Theory and Algorithms for Forecasting Non-Stationary Time Series (NIPS 2016 tutorial)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello well I'm very excited to start the tutorial our first speaker is vit Lucas met Kuznetsov and he's a research scientist at Google he did his PhD at Courant Institute for mathematics and his research interests are mostly in non stationary time series analysis in both theory and applications no further ado thank you so much for introduction to actually let me start by saying a few words why do we actually care about time series analysis in the first place right why are we here today in fact thank series data appears in many many real-world applications these days stock values different other economic variables local temperatures or global climate trends all of these often come in the form of time series data in fact I personally like to argue that all of the data that we are collecting is time series data precisely because we're collecting this data over time right and if you look at this list you'll see that actually in a lot of these applications it is critical to be to do time series analysis and timeseriesforecasting accurately right if we're talking about stock values then there could be millions or billions at stake if we're talking about climate rights even future of our planet but I also want to argue that time series analysis is not only a very important problem but it is also often a very challenging task so let us look at maybe some examples of the time series here so you know you probably have heard about Google Trends right so here's a plot of two time series from Google Trends this is an interest over time in two different search terms machine learning and deep learning when you look at this time series maybe you see some patterns there right maybe some common trend component or any other patterns but this initial intuition that we may have from these simple plots open can be actually quite misleading like for instance if I were to zoom in on you know earlier segments of this time series say for 2012 2013 then what I would see is plot like that which looks very different from my original plot and we can see that this time series actually for instance also both of them look quite different right the machine learning time series is sort of more volatile maybe has some seasonal component on the other you know deep learning is more suitable stable and doesn't worry that much over time on the other hand if I were to progress further in time and look at the more recent data 2016 2015 period then you know picture changes once again all right now we see some sort of trends you know both time series are moving much moving in a much more volatile fashion and maybe surprisingly even we can see that in some cases they are moving in the opposite directions which is also something that is counterintuitive for these two search terms so this kind of phenomenon of non stationarity make time series analysis very challenging and it make time series analysis also very different from the usual machine machine learning that we are used to typically what we expect to see in machine learning is some sort of idea assumption we expect that our training distribution is going to be the same as our test distribution you know we expect that these distributions are fixed and do not change over time which means they're stationary but this is not the case for time series analysis and in general these assumptions are very violated in these kind of snares so how can we analyze this kind of data you know how can we learn from this kind of non stationary time series data these are the kind of questions that will try to investigate in this tutorial and try to answer them and the plan for today is going to be as follows so the first thing I'm going to do is give a very quick you know 30 minute introduction to classical traditional time series analysis then we're gonna have very short two-minute break just to catch some breath and then we're gonna discuss some more recent learning theory that has been developed for this kind of non stationary time series after that you know after theory we're gonna take a longer break the whole four minutes and then after that we're gonna switch gears and talk a little bit about algorithmic solutions that are based on this theory that we will present and finally we'll discuss some connections between time series prediction and this very hot topic this day is that of online learning alright so let us then jump right in and let us talk about the Classical time series analysis so the way classical time series analysis works the way these classical framework operates is as follows the first thing that you typically do you postulate some parametric model that you assume is generating the data that you are observing then once you have done this critical step you take the data that you observe and you use this data to estimate the unknown parameters of this parametric model and then finally you know once you have this estimated model you use it for whatever purposes you want it to be used namely for instance prediction so what what kind of models are out there well if you start thinking about what what you're trying to do right somehow what you trial you're trying to do is to use your past observations to make the forecasts about the future right so we're a very natural thing to do would be to take you know last P observations combine them in a certain way and that would be your forecast of the future and that's precisely what otter so-called auto regressive models are they are linear combinations of your past P observations with ways a 1 a 2 up to a P which we call the regressive coefficients and once you form this linear combination of the previous observations Y T YT minus 1 YT minus 2 up 2 YT minus P you add some noise term epsilon T and you assume that this is going to be the next observation in the sequence so these noise terms that you typically add they're uncorrelated 0 mean random variables and they're supposed to capture the uncertainty that is not captured by the linear combination of the past observations at this point you may say well you know why only restrict yourself to the uncertainty at time T right maybe uncertainty at previous times also may play a role and they send if you kind of start thinking along these lines then the model that you're going to arrive at is going to be this so-called moving average model which assumes that the observation YT is a combination of the uncertainty at time T epsilon T and uncertainties at previous times epsilon T minus 1 and so on t minus 2 up to epsilon minus epsilon T minus Q and they're combined with certain ways which we call moving average coefficients and then combining these two models the auto regressive model of the past observations and the moving average model we arrived to a very classical and celebrated model which is called ARMA model autoregressive moving average model it was proposed I think back in the 50s by Whittle maybe even earlier and it was actually popularized by box and Jenkins in the 70s so let us actually look at an example of a process distributed according to ARMA model and here's one you know it looks like a bunch of random noise ups and downs some kicks but sort of one interesting thing that you may observe is the following if you take two chunks of this time series right like say an earlier chunk and a later chunk and if you look at these two chunks then in certain statistical sense you won't be able to distinguish these two chunks and this phenomena is what's known as stationarity right more formal it's defined exactly in this way if I take any chunk of my time series and I shifted by K units and if I look at the distribution of the original chunk and if I look at the distribution of the new chunk and if these two distributions happen to be exactly the same that's what we call stationarity in fact this is often referred to as strong stationarity and weak stationarity or second moment stationarity is a property that only requires basically the first and second moment of these distributions to match or more precisely you want the mean of your stochastic process to be [Music] independent of time and you want the covariance function of your stochastic process to be independent of time which means that the covariance or correlation of any ztz t- j only depends on the lag j but does not depend on the exact time point t so you know is our original intuition about the arma process correct is it actually stationary right because it seems from the plot that I have shown to you that you know any two chunks would be kind of exactly the same so we need some tools obviously for this analysis and sort of the state - you know that you should be familiar with if you want to do anything with this kind of autoregressive models is so-called lag operator so it's a very simple thing basically what this lag operator does is that you know if you're given an observation YT and you apply this lag operator out to this observation YT then what you're going to get back is the previous observation YT minus one very simple but then what you can do you can actually rewrite the ARMA model that we had emphasized before in terms of these slag operators in particular the way it's going to be it's going to look you're gonna form this polynomial in terms of the lag operator right so on the left hand side I have this one - a the sum of AI lag operator ^ I sum from I equals 1 to P and I apply it to YT and on the right hand side they form a similar polynomial in terms of the lag operator and apply it to the noise term and actually if you just you know work out the algebra and apply the definition of the lag operator you will arrive at exactly the same formulation that we had for the ARMA model a few slides before but the reason this is nice to write out the ARMA model in this way is because it exposes interesting properties of this ARMA model in terms of the Lag operator and this special polynomial that we have on the left-hand side in fact we're going to call this polynomial on the left-hand side characteristic polynomial right it's this polynomial that I denote by P here P of that is 1 minus the sum of a set ^ I summed from I equals 1 to P right and it precisely characterizes the stationarity properties of arma processes in particular you know a very standard result says the following if all the roots of the characteristic polynomial P that we had on the previous slide lie outside of the unit circle then the arm process PQ is going to be strictly stationary right this is actually a very simple result and in fact the proof of this result is also rather simple it's just basically applying what you know about the roots of the polynomial but in the interest of time I'm going to skip the proof and in fact we're gonna post the slides online so you can be able to check out the proof for yourself if you want a reference but you know let us actually go back to what we started talking about originally and right we were interested in mode modeling time series that are non stationary because they're much more realistic for any application that we want to tackle right but we have just said that these classical ARMA models are stationary right are we doomed can we actually change things can we actually use this generative framework for modeling a non stationary time series well it turns out that you know this this lag operator and this characteristic polynomial IDs allow you to also build models of non stationary time series as well right in particular one way to do that is to use ARMA processes whose polynomials characteristic polynomials do have units roots right they have roots that lie on the unit circle so if that's the case then you can factorize such polynomial into a polynomial RZ right which has all its roots outside of the unit circle and then polynomial 1 minus Z to the power of D which has a root a unit root of multiplicity 1 and if you have a an ARMA process with such characteristic polynomial then this arma process is going to be non stationary and that's how these celebrated ARIMA models of order d are defined so they in a way you can think of these ARIMA models and i here by the way stands for integrated as arma models of a process which applies polynomial one- lag operator ^ d to the process YT right so you know take for instance a simple case of D equals 1 right so what a Rhema model of order D equals 1 says is that basically YT minus YT minus 1 is a stationary arma process qualities so how do ARIMA processes now look like well you know if you plot 1 you get a picture like that that's sort of much better in terms of non stationarity we get some at least stochastic trend here perhaps and you can keep playing this game you know if you want to sort of model all other kinds of sort of non stationarity phenomenas explicitly you can come up with all sort of extensions of ARIMA models there is a whole zoo of those you can incorporate a seasonal component you can incorporate long memory you can incorporate so-called you know fractional effects time varying coefficients you know there is really no limit to that you know nonlinear models etc etc one probably extension that and you talk like that should mention is the one that tries to incorporate non stationarity in the second moment of the time series in other words try to model a non-stationary variance and that's what's known as a GARCH model right and this model simply says well you know I want a non-stationary variance so I'm going to assume that my variance is also is a stochastic process which is just an ARMA process and that's how you know I define a guard model so it was introduced by angel in 80s and that's what he got the Nobel Prize for and you know here is an example of I'm a process you can see that volatility kind of changes over time for this time series and then you can continue you know coming up as I said with all sort of other more fancy models so another example is a state-space class of state-space models which you can basically think of as just a continuous space and a lot of hidden Markov models the way they work you assume that there is some unobserved state XT that follows Auto regressive models and each next state XT plus 1 is simply B will be some parameter of the model times the previous state XT plus an oyster right and then the observations that you actually get are assumed to be a linear combination of the state right so YT is equal to a times X XT plus an oyster so ya example of a particular state space process right and you know really you can continue as I said playing these games and keep coming up with different generative models for this so we've talked about a lot about actually this first step in the classical framework for time series analysis postulating a model right so then how can we perform the second step right estimation how do we find parameters of this kind of autoregressive or state-space models well I guess the first thing that you probably would think about is maximum likelihood estimation right let's make a further assumption about the noise terms that we have introduced for instance that they're Gaussian then we can maybe write down the log likelihood of the data that we are observing and try to optimize the parameters of the model to minimize the negative log likelihood for certain other models there are also less general techniques such as method of movements or you know conditional or unconditional least square estimation that you can apply for instance to some of the auto regressive models that I mentioned but the question is you know we're not in this scenario anymore right do these methods actually work right can we give any guarantees for these kind of methods and yeah you know people actually have worked on this and they have worked out some learning guarantees for these kind of methods under some say additional assumptions like for instance if I wanted to give a guarantee for ARMA model that I have described to you earlier then I would impose an assumption of invertibility which basically means that this other characteristic polynomial that we had on the noise term in the formulation of the ARMA model has to also have its roots outside of the unit circle and in that case the the model the ARMA model is invertible and if I have both say invertibility and stationarity for ArmA model then i can for instance give a guarantee that the least square estimator of the autoregressive parameters of that model will converge in probability to the true parameters of the model right so this is an asymptotic learning guarantee for this kind of models it also requires the model to be correctly specified but at least it gives some hope that these models can work and I should also mention that of course you know similarly I'm talking here about just least square estimator for ArmA models but you know you can come up with similar asymptotic guarantees for other models and other estimators that I have mentioned so actually you know I'm gonna these more or less wraps up the sort of lightning fast introduction to the classical time series analysis I just want to sort of emphasize a few things that we talked about first of all there are many many generative models and you can come up with a lot of those generative models yourself and you know model all this phenomena in whatever way you like the learning guarantees for these kind of models that we have discussed only typically asymptotic and they also require the model to be correctly specified so you know they are not robust to miss specifications of the model and finally one of the major difficulties with actually applying this kind of models is that you actually need to come up with with the model of the non stationarity yourself so you need to go ahead and look at the data and then say you know this is I think what kind of non stationary phenomena is happening in the data and this is the kind of model that I'm going to impose so and then in the remaining of this tutorial we're going to try to discuss some new theory and some new learning algorithms that would try to lift these sort of assumptions so we'll break now for two minutes to just forget to catch breath and then we'll continue with the theory part of the tutorial okay so we're gonna continue with the tutorial our next speaker is Mary Omari he's a professor of computer science and mathematics of the Quran Institute and he's also a research consultant at Google research alright thank you very much first of all I wanted to say that I'm really really impressed to have that many people here for at answer is tutorial we knew that it was popular now that's just a proof I promise to say this this section is about theory learning theory and when some people ask me why do we need learning theory and well the answer is otherwise you're left to you know do parameter search in a blind fashion and there is no principle for designing an algorithm vitaliy gave a description of some classical techniques in time series prediction he also mentioned that there are some insufficiencies for these techniques in particular they do not have the standard guarantees that we expect for finite training samples their guarantees are asymptotic they also need to they also come with a whole bunch of assumptions so the concepts and tools that I'm going to discuss in the next few slides or hopefully are going to be helpful for designing algorithms for times there is in a more general way so let me start with the O so let me start with the formal description of time series forecasting in that scenario the learner receives a finite training sample which means you know a finite realization of a stochastic process x1 y1 x2 y2 XT YT where the X is sup T's are in some input space capital X and the y sub T's in the output space the label space capital y and I will be denoting by capital Z the cross product of x and y and then we assume that we have a loss function that is bounded by 1 and that takes as arguments a hypothesis little H in some hypothesis set capital H of functions mapping from X to Y and well an element of Z that means the pair x and y so the Lerner's problem is to use the hypothesis capital H to come up with a function little H that has a small generalization or if you wish or path dependent expected loss and that is defined in the bottom of the slide and it will also give me the opportunity to describe the notation so the calligraphic L of H comma bold Z 1 T so for first ball z 1 T denotes the sequence of disease the training sample if you wish from 1 to T and it is defined as the expectation at time capital T plus 1 or at the time at which you wish to make a forecast of the loss of the hypothesis for that as Z T plus 1 drawn from that distribution conditioned on what you have observed so far that means Z 1 T ok now the standard assumptions that are made in the analysis of time series forecasting or stationarity and mixing stationarity was already this rather than mentioned by vitaliy in short it means that the distribution is invariant to shifting mixing is a different type of assumption which essentially says that events capital A that occur after time n plus K have a dependency with respect to events capital B that occur before little and that the case with this gap little K okay and this decaying dependency can be measured in many different ways one way to do so is to use the standard the so-called better mixing parameter there are many other ways of measuring that alpha mixing v mixing there's almost any Greek letter that you could append mixing to its own so it so happens that beta mixing is actually one could argue the right view and I'm not going to detail exactly what technically better mixing consists of because you'll see in a minute I'm gonna criticize all of these and make a lot of enemies I'm sure so but before doing so let me just say that most people who have been studying from the learning theory point of view time series and that includes me and many of my colleagues they have been adopting these assumptions stationarity and mixing and that includes the work by VD cigar of pack learning where he is actually arguing that beta mixing is the right view the generalization bounds given by you for in terms of the v c-dimension for binary classification covering number bands for regression given by ron mayor we have been giving with action grouse time is a day had a maja complexity data dependent bounds for general loss functions which now gives a more general view of all this and then more recently there have been packed Bayesian bounds given by a Loretta all of these are making these assumptions of stationarity and beta mixing that includes even algorithm dependent studies such as the study of adaboost by Louis a knowitall again our study of stability stable algorithms in a very general way and many other studies of this kind all these raise some fundamental questions because stationarity and mixing these assumptions often do not hold the in fact typically not do not hold just think trend or periodic signals furthermore these assumptions are not testable so how do we know if they even hold right it's even worse than this because even if you assume that mixing does hold make it even simpler assume that you even know what kind of what form that beta mixing for example would have assume for example that it has an exponential form even if you do assume that estimating these parameters turns out to be a very difficult problem but perhaps even worse than this is that these assumptions these notions do not capture some fundamental parameters of the problem those that have structure that means the hypothesis set which we know in machine learning cannot be the family of all measurable functions otherwise we cannot learn and the loss function so this raises some questions is learning with general non stationary non mixing stochastic processes even possible and can we design algorithms that come with theoretical guarantees for such broad and general scenarios again just think that those assumptions really typically do not hold within a new tool for this analysis so to do so let's just start from first principles suppose you have a single hypothesis H so fix H what are we really interested in what we care about is the expected loss or remember again that conditional expectation of the loss given the sample at time T plus one of forecasting and what we receive is just you know training points so the difference between the expectation that expected loss at time T plus one and the expected loss of the same hypothesis age at time T is really the key quantity here that is what we really care about so of course beginning a full sample from time 1 to capital T so it's the average of this quantity that becomes the really important you know difference that is related to time series prediction this is for a single hypothesis if you take the supremum over all hypothesis this gives a definition a quantity that we call discrepancy and so you can see at the top of the slide it's the supreme 'm overall hypothesis of that average difference that i mentioned before notice that this quantity unlike notions that are purely related to the stochastic process captures the last function and it's directly depending on the hypothesis set perhaps even more importantly this quantity can be actually estimated from data as we will see under some relatively mild assumptions notice that the discrepancy delta is 0 in the ID case that's really straightforward to see and one can show also that in some broader cases for example for weakly stationary processes with linear hypothesis and the squared loss the discrepancy is also just 0 okay this definition of discrepancy is based on an average of those differences the average here is uniform but we might actually wish to use an even richer definition still really geared towards the analysis of this problem which we call the weighted discrepancy and where now we are not just using in a uniform average of these differences but instead we use some weights Q 1 Q 2 Q sub capital T for you defining that averaging these weights the little the Q sub T's are going to be crucial particularly when it's gonna come to an algorithmic design why because you would wish to emphasize points T X T's or Z T's in your sample whose difference whose expected loss is close to the time at which you're going to be forecasting and on the contrary de-emphasize those sample points for which the expected loss is much further away from the time of prediction these weights Q sub T's are going to help us do this emphasizing and de-emphasizing two words about this that this definition of discrepancy or weighted discrepancy is strictly generalizing previous definitions that we have been given giving still in the context of drifting with Andres Munoz Medina or for domain adaptation with Isham answer an option Rasta miss a day and with Karina Cortez and that also is related to the notion of discrepancy or a distance that she I Ben David and others and also looked away and others have been giving in the past we can also give upper Browns inter for this discrepancy in terms of some familiar quantities such as the relative entropy but again if we were to do so we would be losing this interesting aspect of the discrepancy that is taking into account the last function or the hypothesis set okay we can also give upper bounds in terms of other other quantities such as the Phi mixing coefficients of an asymptotic stationary process for of course asymptotically stationary processes okay I mentioned that the discrepancy can be estimated so let me discuss this here little bit on the slide that is quad discrepancy can be decomposed into two parts suppose that you know we wish to make a prediction at time T plus 1 there's going to be some interval say small interval of length s just before the time of prediction at which you don't expect the distribution to for which you don't expect the dispersion have changed radically so you sort of expect that the blue part the blue arrow here that indicates the difference between the the discrepancy if you wish between the loss at time T plus 1 and the average loss within that interval of length s you expect that not to be too large you expect that actually to be relatively small in fact I'm going to be a bit further you can prove that if this is not the case learning is not possible ok so so we have to make that assumption that there is going to be some s for which this difference is not too large okay assuming that now which seems to be a natural assumption then we can break the discrepancy into the sum of all upper body by the sum of two terms one is the term that I just described which at the top of this slide is denoted by Delta little s as referring to the size of the interval and the other one is the first term which is the discrepancy between you know that interval of length s and whatever else you have had in your training sample and that is the first term that you see that Delta of Q up there okay it turns out that that Delta 0 of Q that Delta is your of Q which is the first term upper bounding Delta of Q notice that it's all in terms of the expected loss again or conditional expectation that calligraphic L and it turns out that that term Delta 0 of Q can be accurately estimated from data not so surprisingly right because what would you do if you were to estimate it well you would consider the same term with the calligraphic else replaced by the true loss the losses that you've observed on the sample turns out that that is actually a very good estimate of this quantity okay so now if you use this notion of weighted discrepancy that I introduced which again essentially comes from the first principles and you apply some general tools that I will briefly describe you obtain the following learning guarantee which is very general this theorem says that with high probability for all hypotheses H in capital H and for all alpha alpha think of it as a small number let's say 1 over capital T or 1 over square root of T where T is the size of your sample so for all hypotheses the left hand side that is the path dependent expected loss of a hypothesis H the quantity that you wish to minimize is upper bounded by the first term which is the weighted training loss when the weights are these Q sub trees that you might wish to play with algorithmically plus a term that is exactly the weighted discrepancy alpha or 2 alpha I already said you can essentially disregard it as it being a small number plus the norm of the weight vector Q 1 Q T that you consider think that if you use uniform weights that norm 2 of Q is just 1 over T ok 1 over square root of T and then the square root of 2 times the log of a quantity that is a the expectation of a covering number okay I think I've scared you enough now actually raise your hand if you're not scared yet I managed to scare a few then so this is the expected covering number for now just think of it as a measure of the complexity of the family of hypothesis capital G which is the family of the losses of the hypothesis little H now that's what's indicated in the bottom this is the family of functions that are mapping Z to the loss of H comma Z the exact definition of this sequential covering number or expected sequential cover number I will try to just say a few words about a bit later but for now think that this type of learning guarantee is actually not very different from what we have seen up to now what does it say it's sure in short it's simply saying that the generalization error is bounded by the empirical error plus you know the last term is just the complexity term all these familiar the only new term really is that weighted discrepancy ok the Delta of Q turns out that that Delta of Q is really crucial in fact for some scenarios say drifting for example you can show that you cannot do a wave with that discrepancy it's essentially the best measure here that you can come up with okay so the discrepancy is really there furthermore and vitally we'll get into the details of this this sort of learning guarantees can really directly guide the design of algorithms how well the choice of the hypothesis little H and the choice of these weights q1 QT that I mentioned are really really crucial here can be precisely following this bound if you wish to minimize this panel again Vitali will say a lot more about this now there is still a problem here because Delta of Q is that true discrepancy we don't have that true discrepancy but the good news is we can estimate it so we just have to work with the estimate of that turns out I mentioned that before that the discrepancy can be estimated and there is an other learning bound that is in terms of the estimate of that discrepancy so that's the Delta hat of Q which previously was denoted so which is the estimate of that Delta zero of Q if you look at that Delta hat is just precisely the same expression as Delta 0 of Q where I have replaced the calligraphic L by capital L okay so it's just precisely the empirical version of that so this learning bound which only differs from the previous one by the presence of the empirical discrepancies versus the true discrepancies is going to be a quantity that now you can directly work with when it comes to algorithms and when it comes to guarantees this learning bound is probably this the first learning guarantee that is given for general stochastic processes there is no assumption about stationarity or mixing the quantity really that is critical here is the discrepancy or the weighted discrepancy and again I've already said what I thought about the importance of those weights okay so if you have survived so far you've already had the main takeaways from this theory section but now I'm gonna say a few more words and that's only for those who can still survive about the definition of the weighted sequential covering number and to tell you that in the case of stochastic processes so that we cannot just use the standard covering number instead we have to have a sequential version of that so that is sequential a weighted sequential cover of a tree is defined as follows suppose you have a full tree such as the one that is displayed on the slide and a full tree of depth capital T and let's call that tree ball z that means that every node of that full tree is labeled with some element of capital Z our z1 for the first four route Z 2 Z 3 etc etc and what the a cover of that tree is is a set of tree capital V such that no matter what path you choose industry for example if you were choosing the red path there would be a tree in law in capital V a tree little V in capital V such that along this path the vector V and the vector of G of ZZZ would be very close to each other in fact there know that norm one of the difference of these two vectors would be bounded by alpha divided by norm of Q ok that's for example what's indicated in the corner bottom corner of the slide for the particular red path okay so when that holds in general that means for any trees where you label the tree with the Z is and there existed V such that you can be close to that G of Z is along that path when this is true every time then they cap the family of trees capital V providing alpha cover of a sequential alpha cover of the trees E okay so what we would want is not just any family capital V of trees we want the family capital V of trees that has the smallest cardinality okay the family capital V that has the smallest cardinality and that can cover Z is the sequential covering number of the tree Z that's for a particular tree Z really what we care about is not just any a particular tree Z is the expected covering number when we take the expectation over Z over all trees and so to describe the expectation I'm just going to describe it briefly on the slide look at the tree now that is shown here you can play the following game at time 1 which means at the root you can draw two samples if you want Z 1 and Z prime 1 according to the distribution and then if you go to the right you can draw a Z 3 or an z prime 3 again independently but conditioned on the prime variable Z prime 1 if you go to the left you can draw Z 2 and Z prime two independently but conditioned on the Z 1 and so forth right so the expectation of the covering numbers that I talked about that's an expect expectation over Z we're disease are drawn in the according to the stochastic process that I just described ok all right all of that looks a little bit complicated it's not a surprise right we're dealing with a much much more complex situation in the time series prediction we're far away from the easy ID case where we can just apply blindly a whole bunch of learning theory tools now we have to do a little bit more work ok all right so if I have still some time and I'm not sure how much I do I don't see anybody jumping on me so I'm just gonna say a few words I have two minutes and then in those two minutes I have 15 minutes for the theory section Oh what okay so you went from 15 to 2 see this is the way negotiations go you know you can see I should not negotiate anything so I went back to 2 minutes and that's perhaps also what you would prefer right we should just you know you've been suffering quite a bit let me just say with Vitaly we've decided that you would have a longer break here so you really deserve it after this but hopefully you will getting you got the big picture so I'm just gonna give you some really just the high-level ideas for the proof of that learning bound that I just threw at you so one the way the proof starts is in this standard way of applying chair of bounding technique which consists of introducing a variable a positive scalar T that you would wish to optimize you know near the end of your proof to get the best bound and simply of applying Markov inequality so this part is really not so difficult next you have to deal with the come up with a symmetrization and here again I'm just going to give you the high level ideas the high level ideas and the deep techniques behind this or relate to the notion of decoupled tangent sequences for any sequence z1 ZT there existed tangent an associate sequence Z prime T Z prime 1t that is a tangent sequence associated to that what does that mean it simply means the following the C sequence the sequence Z prime T is such that Z prime little T and Z T or independent given the past g1 ZT minus 1 ok so imagine that you in fact draw Z prime in this fashion whenever we have reached time T you draw ZT of course conditioned on what you have seen so far but Z prime T also in an independent fashion ok so it's clear that you can always construct a decoupled tangent sequence this way of decoupling coming up with a Z prime that decouples and it's this it has this independent aspect to it is crucial to the proof and the analysis ok so that's what what is going on on this slide in the next slide it's another property of tangent sequences that we're using for the introduction of Kodama variables and then in the last slide we are making use of that somewhat complicated sequential expected sequential covering number that I mentioned by applying a standard Union bound ok so I just gave you a very very high level idea of but it's essentially what you would really need and as Vitaly mentioned of course you will be able to look at these slides and look at the details to be convinced that indeed the proof is correct okay so I'm going to stop here and I wish you a long break and when you come back though you will hear about how these this theory can be used effectively to come up with algorithms okay that have strong guarantees now Vitali wouldn't describe that okay thank you [Applause] [Music] okay so let's get a get settled in and let's continue and talk a little bit about algorithms that one can developed develop based on this theory that we have discussed before the break and I hope you're all not too scared by it but the bad news is that actually I'll have to show a bit of theory for you again but don't be too scared if this is just a review actually so you should all know what it is already right so it's it's actually exactly the same bound that you have seen before the break right and just to remind you what it says right it says that with high probability uniformly over hypothesis set capital H generalization error of any hypothesis little H in that set of hypothesis capital H is going to be bounded by Q weighted empirical error of that hypothesis on the sample empirical discrepancy this discrepancy Delta s that we hope to be small and a term that captures the complexity of the set of the hypothesis that we are using I would like to make two important points about this bound first of all we can extend this bound to hold uniformly over the wait skew in some bounded set and second thing that we kind of tried to emphasize also before the break in the theory section is that these guarantees are data dependent right and the reason these two points are important is because they directly motivate and help you derive some algorithmic solution to the problem of forecasting non-stationary time series and the key idea here is to take this bound that you can actually for each hypothesis compute on the data right and try to seek a hypothesis H in the set of hypothesis capital H and the weights Q over the sample that would directly minimize the learning bound on the previous slide so this is a very general idea of the algorithm but in the rest of this presentation I'll try to argue that actually for certain very broad families of hypothesis sets and loss functions you can solve this kind of optimization problems efficiently I tend the first set of hypotheses that you may think about and the first set of hypotheses that you may be interested in is the set of linear hypotheses right so we'll consider a set of linear hypotheses with a squared loss just to remind you the squared loss is L of y y prime is defined to be Y minus y prime squared right something that is very natural for the problems of forecasting and linear hypotheses or more general kernel based hypotheses associated with PDS kernel K are defined as follows right we take a feature map that is induced by this PDS kernel right or simply you can think maybe of linear kernel right in that case this is just an identity map right we map our observations X by this map Phi K to some Hilbert space right and our hypothesis is going to be simply a linear function associated to some weight vector W and it just takes the dot product of that way weight vector W with the representation of our observation in the hilbert space and will control the complexity of this of this linear hypothesis by restricting the norm of the weight vector using some parameter capital lambda so this is a very standard set of hypotheses that is used everywhere in machine learning so for this setting of time series sequential models you can actually upper bound that very scary sequential covering number term that appears in the learning guarantee in terms of the parameters of this hypothesis set and the data in particular you know I give a bound here on this slide that this complexity term is bounded in terms of this parameter lambda that controls the norm of the linear hypothesis and the radius of the PDS kernel on the data right the supremum of the kernel x comma x over all the x in the sample we're also going to make another simplification and instead of considering the empirical discrepancy directly we will look at what's known as instantaneous discrepancy and you can in fact upper bound the empirical discrepancies that appeared in our learning guarantees by the sum of Q T DT where the sum goes from T equals 1 to capital T plus a term that captures the deviation of the weights Q from uniform weights in L 1 norm and these duties in this bound what what we call instantaneous discrepancies and they're exactly the discrepancy between the most recent as observations that we have in our sample and an observation at time T that's the definition of DT for each time again I'm not going to go through the proof of this result here but you're welcome to look at it at your free time and this is actually a very simple result that just uses subadditivity of supremo but what is important to actually discuss here is the computational aspects that are involved in estimating discrepancy can we actually compute these quantities and we said that statistically we can estimate them but it doesn't mean that they're computable well if we look actually at the definition of this for instance instantaneous discrepancy DT at least in the case of this kernel based hypothesis with squared loss then we'll arrive at the following perhaps also scary looking expression but let's let's take it apart right what this expression tells us is that to find instantaneous discrepancy we need to solve a certain implies a ssin problem right and the objective of this optimization problem that we need to solve consists of the two terms in fact it's a different of two convex functions which means the optimization problem that we are going to be solving is a DC programming problem and we can use existing machinery for solving this kind of DC programming problems to find what discrepancy DT is but what is even more important that in this special case of a squared loss one can actually use special DC algorithms difference of convex function algorithms that would allow you to recover a global globally optimal solution to this non convex optimization problem and this is important because it allows us to efficiently and accurately compute discrepancies that are going to be critical for our algorithms as you will see so okay let us put all of these together so first of all if we put all of this together for the special case of the kernel hypothesis with squared loss function we have this learning guarantee which is almost the same as what we've seen before except we replace the complexity term with this other upper bound that I have discussed earlier and we also have an additional term in them in the bound that comes from the fact that now this bound holds uniformly over all the waves q but this bound now can be computed efficiently on on the data because it's data dependent learning guarantee and all the terms are computable so there is a corresponding optimization problem that needs to be solved in order to find the hypothesis that is going to be used for forecasting right and this the objective in this optimization problem is going to consist of the following four terms that directly come from the upper bound so the the first term in the objective is your empirical q weighted l2 loss and with very natural this the second term is the Q weighted sum of the instantaneous discrepancies that we know how to compute the third term in the objective is a regulation term or the linear hypothesis that we're going to use and it directly comes actually from the complexity term in the bound right and the last term controls the deviations of the vector Q from a uniform distribution and it serves as a regularization for the this parameter that we are optimizing all right so here are also lambda1 lambda2 and lambda3 are just hyper parameters that you set we're cross-validation note that this optimization problem is joint over both Q and W and at this moment you may be thinking wait a second you know is it ok is the optimization problem easy to solve is it convex and in fact that's a very right concern in fact this formulation of the optimization problem is not convex but you can alter this optimization a little bit to arrive to a convex formulation in particular what you can do is you can change the variables and set RT to be 1 over Q T and then if you further upper bound the regularization term for the Q's that appeared in the previous objective then you arrive to this new formulation and it's almost exactly as the previous formulation we still have the first term which is a weighted empirical error but now the weights are 1 over R T and this is I remind you a convex function in W and R because this is a square over linear function then the second term is this weighted instantaneous discrepancy term but now the weights are 1 over RT again this is going to be a convex function for the same reason and the last two terms these are regulation terms l 2-norm and one norm and they are still going to be convex so overall we get a convex formulation for our problem so that's nice let me also mention that there is another way to obtain sort of a simpler algorithm that does not learn qs + w s-- together right but rather that does it in a two-stage procedure and it's also going to be convex and the way this algorithm operates is that you first just minimize your discrepancy empirical discrepancy on the data this way you would obtain a weight vector q right and this also can be shown to be a convex problem and once you have done that you take this vector Q and you solve a weighted version of a kernel Ridge regression problem which is again standard convex problem so these are the algorithmic ideas that you can use for forecasting non stationary time series and we actually implemented some of these and played around with them both on you know some synthetic data and some real data sets and I will now describe some of the experiments that we've done and let's start with just some very simple synthetic examples just to get some intuition about how these things work in practice so on this slide I plot for you a very simple stochastic process which is basically auto regression of other one with a slight twist so for the first thousand observations I said that the parameter daughter regressive parameter in this model to be 0.9 right the auto regressive coefficient and then for the next thousands of observations I switch parameter to be negative 0.9 and then at the end I switch this parameter back for the last thousand observations to be 0.9 again and so this way I introduce this non-trivial non stationarity in the in the stochastic process and it's interesting to see how the estimation of discrepancy behaves in this kind of scenarios and how our algorithms work and what we can do we can look actually at the plots of the discrepancies that would be estimated for this kind of process and you can see they're plotted on the left of this slide in red you see the estimated discrepancies and in green this is the true discrepancy and you can see the estimation is actually rather noisy right but at least it gets the shape right you can see that in the region between T equals a thousand and T equals two thousand right where I had this switch at least the average discrepancy is much higher than in the other regions but what's even more interesting is that these algorithms can actually be quite robust to the these errors in the estimation of the discrepancies themselves so if we actually look at the weights cues that I learned by these algorithms and they are plotted on on the right of this slide and you'll see that you know in red we have the weights that were learned for s estimated discrepancies and in green we have the weights that have been learned for the true discrepancies you can see that in both cases the algorithm completely avoids the region where T is equal between T equals a thousand and T equals two thousand which is exactly what we would desire because in that region the distribution of the process is completely irrelevant to the most recent observations which would be useful for forecasting the next point and just you know to read trip this section up here are some further examples of the error plots for for this artificial synthetic data what we see here is running mean squared error so what we do we train our three algorithms which is ARIMA the discrepancy based forecasting with true discrepancies and discrepancy based based forecasting with estimated discrepancies and once we train them on the first 500 points we forecast the next 25 points I believe then we record their error we put this new 25 points in the training set we will train with this new new training set we forecast the next 25 points and we proceed in this online fashion and then we have this running plot of mean squared errors and as you can see you know before we encounter any non stationarity in the data these algorithms are sort of behaving about the same but then once you kind of hit this non stationarity as you would expect the error in a remo model that is not designed to handle this kind of non stationarity blows up but these algorithms kind of remain robust to this non stationary change in the distribution so to mention this we did some also experiments with some real world data sets and there are some promising results there but I see that I'm running out of time actually and instead what I'm going to do rather than talking about this I'm going to invite Mary and he's gonna talk well well I mean this this yeah sure oh thank you well since since I have some more time you know we did some experiments with some datasets that we found you know on the Internet these are commodity prices for different commodities exchange rates climate and temperature forecasting and as you can see you know from these results you know we compared ourselves against a very standard baseline that would people first compare themselves is ARIMA models and at least against those models we never performed worse than ARIMA models but and in you know I think five out of eight data sets here we we do better so these are the this is the algorithmic section and next what we're going to talk about is the connection between sort of time series prediction and this new hot area that people very are interested in these days that of online learning okay so we switch ears and try now to after this theoretical section and then the algorithms and the experiments to talk about a connection with online learning which as vitaly said is probably one of the most important and you know fast-growing areas of research in machine learning and starting by saying that there are really two learning scenarios for tackling time series the first one is probably the one that we favored in our presentation up to now and that is that of this stochastic scenario which assumes that there is a distribution underlying the data and there there the performance measure is in terms of the expected loss or this this for us this path dependent expected loss and the guarantees are the learning guarantees are generalization bounds such as those that I presented previously the online learning scenario in a way is a much broader scenario it is one where no distributional assumption is made the setup is adversarial the performance measure is based on a notion of regret and the guarantees are there for bounds on that regret this is a very very active research area we've just listed here just some papers and names clearly this is not covering even a small part of online learning definitely would wish to consult the survey book of chazar Bianca and Lugosi if you're interested in this area there is a very rich number of papers related to this we are favoring here are some papers by busca and warmoth or herb stir and warm earth one reason for that is that there was a rich online learning algorithms or family of algorithms for tracking for example that could be very relevant for time series etc etc so since this is since I'm not sure how much the audience here is familiar with online learning I'm going to give you probably the shortest tutorial that is possible for online learning hi for slide tutorial and then we'll see how we could maybe use that to connect it with time series and and solve problems in time series that we don't really know how to tackle otherwise okay so let me start with the setup of online learning in the online learning setup it's an adversarial setting with the hypothesis set or action set capital age this goes over capital T rounds at each round little T the player or the learner receives a point X tea drawn from you know capital X he selects in rate in response a hypothesis H sub T out of capital H the adversary selects some label y sub T and the player suffers or incurs the loss the loss of HT of X T the prediction or its divergence with respect to the true label y sub t so the goal of the learner or the player is to minimize his cumulative loss the first term here in the definition of the regret indicated here at the bottom of the slide the cumulative losses some of the losses suffered over capital T rounds - the best cumulative loss in hindsight looking back at the data suffered by the best hypothesis little H taken out of a family H star and a family H star might might not always be capital H might not coincide with that necessarily okay so that's the notion of regret which says how much you would be different with respect to that reference here is an example of an online learning algorithm that is a very standard one based on exponential weights or exponentiated weighted average algorithms people would refer that as well which is a case where and you would seek to learn with expert advice so let capital H star be a family of capital n experts e1 e2 en and denote by capital H the convex hull of a star that means the you know the convex combinations of such expert functions the algorithm is quite simple it's maintaining some weights W 1 W sub capital n for the experts and so these are index by the time at it's at the beginning this is w1 sub I for expert I time one at the beginning they're all equal to one and then at each round you know the expert the the learner receives XT he predicts according to this average of the functions of the experts and the averaging is made is done using these weights he receives a true labor whitey incurs the loss that is based on this function this averaging function H sub T and he updates the weights in this exponential fashion which consists of multiplying the weight at time T at time T by the exponential of minus etta the loss suffered by that expert I okay so the weight of the expert I is updated in that fashion so if that loss is large the weight is significantly decreased for example okay very simple algorithm but even though it's a very simple algorithm it leads to a very strong guarantee which is the following if the last function L is convex in its first argument which would be the case for a variety of cases in practice let's say for example squared error or other loss functions of that kind and if it's a bounded loss I'm taking values in 0 1 and for any sequence y 1 y T this is very strong because it's not making any stochastic assumption about that sequence there regret of this algorithm at time capital T is upper bounded in the way that is indicated here log n over ETA plus ETA T if you choose a tie in the in the best way it is does this learning rate that appeared in the exponential it says that the regret is upper bounded by the roughly the square root of T times log n so to understand what that means it simply means that you know if you choose a relatively large set of experts so if your capital n is relatively large then this would give you an opportunity for probably having a best expert enhance sight that has a small loss on the other hand you pay for this in terms of this upper Bann log n but the price to pay is only logarithmic so it's very favorable the regret is perhaps not even the term that you would really be interested in it's even more precisely the regret per round so that means when you're dividing the regret by capital T the average regret and here you see then at the bottom of the slide that the average regret is decreasing as essentially one over square root of T okay so that is the these are the type of guarantees that you would see in for a variety of algorithms in online learning again very general setup very general guarantees the proofs for these sorts of guarantees go as follow you define a potential here the natural potential function is the log of the sum of the weights then essentially you prove an upper bound for this potential function and a lower bound you compare those two and that gives you their regret bound here to derive that upper bound you start by analyzing the difference of the potential at time T and time T minus one it gives you the log of ratio of two weights the denominator is the previous weighs the numerator is the new weights which are exponentially updated that you can relate to in fact an expectation in terms of some weights you have to apply of things in equality and that gives you the type of upper bound at the bottom you sum up these differences and using telescopic sums that gives you immediately an upper bound on 5t minus phi0 you can trivially come up with a lower bound by saying that the sum can be lower bounded by a max and that is the two lines that are mentioned in the middle of this slide comparing these quantities immediately gives you the power okay so this is just the very quick tour through online learning probably the shortest tutorial you have ever heard of online learning and you would ever heard here probably this raises though a whole set of questions can we exploit both the batch scenario the stochastic scenario and the online learning scenario to come up with flexible algorithms for a tackling time series prediction one reason for that is that online learning setup as I indicated is in fact very rich I mentioned this way of measuring the regret with respect to the best expert in hand side just to tell you a few words about this you can extend this to the case where in fact you would not be just comparing yourself to the best expert in handset but to a best expert that is varying over time so the to the case shifting best expert so so it's and the type of algorithms for this sort of scenarios have been already analyzed there's a rich family of algorithms that you can pick from they're all simple to implement they're based on simple updates such as the exponential update that I indicated before so they're really easy it's an easy toolbox that you would wish to use so can we use these algorithms to come up with good solutions in the stochastic case for time series another crucial it turns out a usefulness or question that comes up in this scenario is that of can we tackle some really really difficult time series problems that up to now part of these we have been sweeping under the carpet in a way which are for example model selection in time series or learning ensembles in time service prediction which is a crucial problem again in this scenario so let me say a few words about model selection and learning ensembles and then see how we could gradually tackle this sort of things model selection is this general problem of course that we have in machine learning which consists of say if you have capital ant models here time series models the question is how can we choose how can we use the sample z1 GT to come up to find the best model how to choose among these models in the ID case this is a familiar problem we have theory such as the structural risk minimization theory about model selection and we know that that those guarantees can be approximated by using cross validation techniques so reserving some part of the data for validation and using that to pick the best parameter for example but how do we do that how do we do validation says come up with a validation set for stuck at general stochastic processes in time series how should the validation set be chosen the last few points if we do so we are discarding everything else that we have seen before that could have been really really relevant if we choose the first few points then we might be losing even more critical information because that's really information that would be close to the prediction time what else should we do split the data in various different ways and choose those as validation set it's just really not clear what we should be doing and by the way if you hear people talking about time series and saying we're doing model selection in some way cross validation or something ask them what is it that you do huh because I bet you there is no real principle behind it okay so notice that the problem can be complicated in the sense that of course these models have been probably pre trained on that same training sample okay there's another interesting problem that comes up if you wish to do accurate prediction time series prediction in any kind of context whether that is for earthquakes as I heard a little bit from some people here whether it's for predicting the level of battery for your system or I don't know in making predictions in wall street and that is the one where imagine that you have already a family of good products my family of good tan sir is forecasters can you use them to come up with a good perhaps a better and more accurate solution by combining them by taking a convex combination of them I sum of Q tht if the HTS are your forecasters again here it's not a trivial problem because this forecasters the HTS may have been even pre trained they could be the result of training on that sample it turns out that both this model selection problem that I mentioned and the problem of learning ensembles they can both be tackled and addressed by coming up with a solution to the so-called problem of online to batch conversion online to batch is a familiar term and problem for those of you who are familiar with online learning but here it's a much more more difficult problem because we are dealing with the non stationary non mixing processes that are quite different from this sort of sort of ID cases so let me just first define what an online to batch problem consists of and how it can be tackled but before even doing so and just to make sure that I'm not gonna run out of time for saying that there is going to be a time series workshop here at nips and I invite you to attend that this is organized also organized by vitaliy and several other people and I will be describing in much more detail some of the solutions here and connections between online learning and the stochastic case in that workshop so when you go back to the online to batch problem the only online to batch problem is a very natural problem let's denote by bold H the sequence of hypotheses that an online learning algorithm has been returning or outputting over the course of capital T rounds so so the the problem is to come up with a hypothesis okay so I'm a little bit confused here okay so this is what the input is to the problem that's why that's the input to the problem and the problem is to come up with a hypothesis little H in capital H that has a small expected conditional expected path dependent expected loss using these hypothesis H 1 H 2 H capital T that the online learning algorithm has been producing okay how do we come here how do we use those hypothesis that the online learning algorithm has produced use them to come up with their hypothesis that now has a good path dependent expected loss this is not anymore a matter of regret okay in the eye deep set up this problem is well known as hi and has good solutions but again how do we design solutions here for this general time series scenario one question is even is online to batch with general non-stationary not mixing stochastic processes even possible can we design algorithms with guarantees and here again we need a new tool that tool is not surprisingly similar to what we discussed before it should be similar it's a discrepancy tool but there is a new twist to it because here we're not dealing with arbitrary hypotheses we're dealing with hypotheses that are output by the online learning algorithm so if we look at the same key difference that key difference now is for a particular H sub T that is output by the online learning algorithm okay and so when now we are taking the averaging we're not going to be taking the averaging over that's a fixed H sub T but the H sub T is that the online learning algorithm has been producing at each timestamp little T and that's what the quantity that is in the bottom of the slide indicates oh that's the average difference here that's a difference you see the HTTP varies with time here that leads to a notion of discrepancy again starting from first principles right that is similar to what we saw before it's a supreme now over the choice of some hypotheses but it's a supreme over the bold age that means the supremum over that sequence H 1 H 2 H sub capital T of this waited difference okay again here the weight Q sub T are crucial they're going to serve us in the same as I indicated before to in in short tell make use of the fact that some hypothesis H sub T at time T are much more useful to my prediction time at time T plus one than others and the weights are going to help me emphasize or de-emphasize them so this is a very natural measure of non stationarity or dependency as in as in the stochastic case as in the stochastic case it captures the hypothesis said that loss function and as in the stochastic case it can be estimated with some mild assumptions in fact pretty much doing exactly the same as what I indicated before but you'll see that there is even another way of estimating that the discrepancy here the online discrepancy that is quite interesting so I will briefly touch on that and this discrepancy definition is the generalization of the type of discrepancy that I indicated before so quickly speaking we could use the same methods for estimating the discrepancy as I indicated before for the stochastic case but here there is an alternative method that is very close to the online learning view suppose that the last function is Mew lips cheese and suppose that there exists actually in half a hypothesis H star that has a small expected loss at time T plus 1 now that's the conditional expectation let's denote by Etta that expression disregard the little Q here it should not be here but it's not it out of Q is just Etta okay so we can assume that is a very good solution at a time T plus 1 that means that that netted the the best expectation loss is actually small if that is the case then you can actually show that the discrepancy the left-hand side in this bound can be upper bounded by a term the hat term the discrepancy hat from the empirical 1 plus mu times that etta plus you know a term that is small as I indicated before alpha and a complexity term so definitely the the discrepancy can be estimated here essentially by making use of the fact that there would exist a good hypothesis there exists but I don't need to know it that good hypothesis that has a small error here okay so showing this turns out to be actually simple you'd be start with the definition of discrepancy you break it into two parts to make the discrepancy term expression appear you use the additive 'ti subadditivity of the supremum to break it into two you use the lips justice of the loss and then just the fact that the acute is sum to one okay now it turns out that the on line to batch problem here admits a very nice solution here as well if the last function L is a convex loss and its upper bound by Capital m then simply choosing a hypothesis that is the average of the hypothesis returned by the online learning algorithm a sum of Q th D that some admits a very good guarantee in the stochastic way that means in with high probability it's expected loss path-dependent expected loss is upper bounded by an average of the training sample losses the sum of QT of hte T's plus the online discrepancy plus a term that depends on the upper bound of the loss and of the weight vector which again you can think of as being essentially close to one over square root of T okay so this is a strong learning guarantee funder for the online to batch because it tells you that it is possible even in this scenario to combine hypotheses to come up with a good accurate one I'm not going to go through the proof but it's just really two or three lines but is even stronger is the following guarantee for again in the case of a convex loss function which says that for the same hypothesis H the the path dependent expected loss can be upper bounded by the loss of the best expert and the best or the best sequence H star plus the discrepancy plus again plus a term that is the regret per round which we showed can be upper bounded in many cases for in online learning by a term of the form 1 over square root of T and other terms that you saw appearing already in the analysis that Vitali showed in terms of the weights that are used in the stochastic case okay so the high-level idea here is that online learning can be used to produce a whole a sequence of experts H 1 H capital T those experts can be combined using some weights Kuban QT to come up with a very strong hypothesis in the stochastic case the guarantee for that hypothesis is given here same story as before we can choose those weights Q 1 Q T in the best way to minimize that upper bound these are things that I will be able to describe in more details if you attend that the talk I will be giving in the workshop ok I'm going to stop here and conclude with the following remarks about time series time series forecasting is a key learning problem in in a variety of important tasks vitaliy mentioned many of them in the beginning the most of the data that we are facing today are in sequential in nature so time series would be the natural way of formulating the problems at the same time cancerous prediction in the most general case is a challenging problem both from the theoretical point of view and the algorithmic point of view if you want to come up with good solutions accurate solutions and solutions for which you can give guarantees okay this is of course a very challenging problem also in applications and you know that in a variety of cases people would want to have very accurate solutions for a variety of tasks of this kind for sequential data what I presented Vitali and I presented is a series of learning guarantees data dependent learning guarantees for the very general case of time series prediction non stationary non mixing no assumption the notions of discrepancy that you heard multiple times in this talk are really really crucial for tackling this problem both from the theoretical point of view and the algorithmic point of view we discussed algorithms guarantees and showed also that these algorithms in practice can be effective the combination of online learning and batch or stochastic prediction leads to a whole other set of very very interesting solutions I try to say a few words about them to sort of tell you how this could be done in particular they can this can be used to solve problems that are really difficult we don't really know how to tackle them otherwise for model selection or learning ensembles again these are things that I could talk about a lot more and let me finish by further advertising this workshop it's Friday and it's in room one month seven okay [Applause] you [Applause]
Info
Channel: Steven Van Vaerenbergh
Views: 4,183
Rating: undefined out of 5
Keywords: conference, nips, nips2016
Id: zSlvCvqcUAg
Channel Id: undefined
Length: 105min 5sec (6305 seconds)
Published: Thu Jan 18 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.