Empirical properties of financial data (QRM Chapter 3)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay welcome back so good afternoon everyone my name is Mario's I'm an assistant professor in statistics at the University of Waterloo in Canada and I'm a bit of a mixture between a statistician a probabilists and a computer scientist this afternoon I'm going to be the computer scientist mostly 99% so I will give you a small introduction very small introduction to our where you can find sources and help and these things and then I would like to use our as a vehicle to show some things you learnt this morning to you in our and this is really why we why we want to use our it's essentially gives us another possibility of digesting of learning the things we discuss in terms of formulas in terms of mathematics and probability theory so this is going to be the the plan so I had I was a witness of an active discussion during the lunch concerning differences between Python and C++ and so I know you're a very diverse group the Barents is quite quite large that might even more so of like when something ow so I'll do my best to kind of set the level right for everyone but it's it's not going to be very easy just to warn you so yes yes so it was my first question how many of you have used our before okay fantastic how many of you who have written functions in our using for loops okay how many of you know what L appliers how many of you know what via pliers aha and how many of you have written packages yourself ok good ok great so I'll try my best to the tool to make it right for everyone so the first ah okay of course that's all ya akhi goods yeah okay so the the first question the first question why are itself right so for us it's it's free it's freely available that's fantastic that's that's one part computation why is it has a lot of statistical tools right this is what we what we need I don't need to implement my own goodness-of-fit test in C++ or something like that so all these tools are available and then graphics graphics is very important fast because we mostly want to visualize things and that's maybe the biggest message I would leave to you with this afternoon that you see how much you can extract from a nicely made blot you can see much more than then from a table of numbers or single numbers you always want to blot functions and and learn from the behavior and then it's higher level than other programming languages which is great if I want to optimize the function I can simply use the optimized function or optimization under constraint there functions available to do that I don't have to do this on my own and there are lots of packages available that implement all of these tools I might need at some point now they are close to 9,000 packages at the moment they are far over a hundred submitted every week there are close to a hundred published every week so there's this huge database of packages we can make use of and if you search for something specific it's quite likely that there's already a package it's also very likely that there are more than one packages that that do the same thing so so you will have a lot of packages available we are to find out so the domain our website is the one you see on the right sides of the of the our project the one you most likely want to go to is the C run or cran website I call it Grantham says Iran so this is this website where you find a lot of information about about our and and sources as well and so on so for example not many know about task views so if I have the problem let's say I want to do time series modeling as Alex will do later today then I can go to the task views and I see a lot of packages described here I roughly get an overview what's available what I can use it for example I see the rugelj package some well at least it was here before let me just search for that yeah so I see the rugelj package and and that's the one we will gonna use so you get nice overview of what's available also there's a task view an optimization for example if you have optimization under constraint you get an overview here of the packages you could use so task queues are are very helpful and then you get of course the the our sources to install or binaries as well I think that so that's maybe not so interesting to go into detail here but then the packages now so you can look into we get an overview of all the packages and for example you you can look into let's say the QM to its package or just search for qrm and see what's available there and then once you click it you see an overview over the package you see the current version when it was published you see what about PAC other packages it relies on or what other packages it uses every once in a while you see the corresponding maintainer so if there's a problem you can write to the command maintainer and then the package name you get the current maintainer of the package and so on now and then you also get a manual here and that manual is essentially a PDF with all the help files and also you can jump to a function and see what it does and also for example we will use the risk measures later today and there you see there some risk measures implemented here the one we currently used is not in here yet because the packages is just started essentially and is under heavy development yeah so I think the last update I did was and Saturday also so be aware that if something does not run because you're missing the function or anything always get the very latest rules very latest version of the of the packages and our itself as well ideally now so you get you see the functions that are there and and the arguments and I'll look into that a little bit closer I'm suing inside Alden now so you get that overview you also get vignette as' and vignette these are simply HTML files that give you a nice overview of certain aspects in the package for example one of the vignette is here is about fitting and predicting value at risk based on our couch processes so a bit a little bit of a mixture between but critical did in Chapter two and Alex will do later now and it's a very nice HTML files you can extract the code with you you see a lot of nice blots and so on now so many packages have been here just by now and they give you very nice overview of a certain aspect now you also get the package sauce so if you really want to know how something is implemented for example one idea would be to download the package sauce and then change in that directory unzip it and then look inside and you get all the possible functions in there and so on now help file as well good so there's a lot of a lot of things they are there also the manuals now I can particularly recommend to an introduction to our it's a good manual if you haven't worked with our before and then there's the writing our extension for the professionals that also to package development what else is there they are the FAQ was there yeah we threw them down some really nice ones especially the point to point seven is a nice one that tells you how it to get help about our now so there's a an overview about the possibilities how to get help on or now then you have the manuals as we just discussed for getting help on specific functions yeah inside certain packages for example you have two tasks use which we already discussed you can simply use Google with our help or something like that you know I'll - help typically and then what you are looking for that typically leads you to to some answers they are also specific our related search engines if you are than that and there's an our mailing list our help is the general mailing list you can subscribe to and get news about our and interesting questions and so on the Stack Overflow you might have thought of there you can post questions and other people can answer that's typically that typically goes very quickly if you have a good question then your question will be uploaded by others so you also get points for asking good questions but these are non trivial things so if if it's answered somewhere else they go after you so don't post post trivialities there are only good questions you couldn't find the solution of then there's the our craft gallery I already opened it earlier here that just shows what kind of nice graphs you can make with our all sorts of different graphs so let me just give you an example here and you can go inside and then you see the graph and you also see the our code so if you see a plot and say hey that's actually something I want to have then you can see the our code as well and you can learn from there then our bloggers that's a blog about our in general and you find very nice things that people have done with our and yeah so that's more in more generality more generality then of course this is all externally right so internally from within our you can also get help let me just do that here so I start in our process and let's say I want to find the root of a function and I don't really know how this works but I hope that it can use the Union root function say use question mark unit and I get the help file of the uniboob function and so this is essentially what you've seen before in the manual as a PDF so you see the function you see its arguments are explained here it's always the same structure you see details about a little bit about what's going on underlying or what what to take into consideration when using this function you see the return value explained what is it here in this case it's a list of different things and what does it contain the root the function value at the root and so on now and possibly references to the underlying I've written some times also who implemented it and then examples and that was examples are typically helpful you can just execute them and learn from that so essentially you see here I executed that example so this is my function here and I compute the root of that function and here this is the corresponding output so it's extremely helpful to learn from these examples at the bottom of of the help files so that's that's typically the way to go now how can I work with our most of you probably work in our studio and that's also something I would recommend Alex uses our studio as well for example later on so you see the the GUI interface the browser interface of our studio and I work in a slightly more specialized by with Emacs and years as that's helpful for package development so that's kind of nice set up use you see at the moment the the ideal overall is that you write a script a script is a file with ending dot R and in there you have the commands you then execute line by line or paragraph by paragraph and so on you can then also send the script to a different computer if you like now to execute it in parallel for example if you have a bigger simulation study involving you know certain risk measures computations and so on you know time series modeling and things like that now but if you start with just the script will ending dot R and have the commands in there you can typically in all sorts of quiz available steps with that line wise and then you see the output and that's typically the way to work with with home now how to install as I said well it's the R itself you can look up on ran of course installing certain packages works with the install dot packages command from inside our now so if you need to RM tools from grant then you to install the packages quotation marks and then QAM tools and so that's the package we will we will use today and if you need the very latest version because he saw on grant was not was not updated since since end of May essentially then the latest version you always find on the development server which is our fault and so if you want to install the very latest version by the way this is also mentioned on the on the website you don't have to take down notes here then you install the packages qrm towards but then read posts specify the repository Rapids equal to and then that's the that's the website of like the the server site of affirm our forge now so that's the one you should used and then you get the very latest version good now the scripts we present to you they are all available as well so how do you how do you get those so you go to our website the cure and tutorial website Paul already mentioned so there you find a lot of things for example here would be the full set of slides point mentioned earlier today and then the output would be would be available here so here you find according to the different chapters the corresponding our scripts now at the moment it's a bit tedious to download all of them one by one if you want to complete repository just click on complete github repository and then else this button that says clone or download and if you click here you get you get the whole repository and that even includes the full set of slides if you like now so this is this is how you can work with it and that script I actually have open at is this first one say one an introduction to our programming so this is exactly that script I've I've currently open good um so this script takes about 80 minutes and gives you an introduction to our obviously I'm not going to do that so execute that line by line and and learn from it there are lots of comments in here now already explaining things like division by zero and and certain problems of this type and go through it on your own it's a very good exercise there's just something I would like to very quickly point out to you further down namely random number generation we very often generate random numbers from self and distributions and I would like to very quickly give you a little bit of an overview here so now I'm starting a new out didn't just start a new our process here so you see that process down here and I generated two random numbers from a normal distribution now our without us it generates a seed now there's no such thing as randomness on the computer there are only sequences that mimic randomness now and pseudo-random number generators do exactly that they walk along the sequence and give you with the random numbers now so this is essentially what's what's happening and the reason why I mentioned this is because for generating from the normal distribution did the method that our actually uses is the so called inversion method you see that here and this inversion method you will see that again in Chapter seven to tomorrow afternoon this inversion method works as follows so it essentially uses value at risk or the quantized function to generate random numbers so all you need to know is how to generate uniform this from a uniform distribution on zero one and this is done in a very complicated way in the so called more centrist of the other pseudo-random number generators available but this is virtually the only thing that the computer really knows well how to do and then to get let's say a Pareto distribution or a normal distribution you apply the corresponding quantile function to that you and also essentially our our Value at Risk at this random confidence level if you like and this then turns out to be a random realization of a random variable from that distribution F and the proof is very simple very simple one-liner so it just show that this new random variable now is indeed distributed according to F and all you do is you bring the quantile function to the other side and this is absolutely a crucial step in understanding also the multivariate case which you will then learn in Chapter seven tomorrow but the reason why I show this to you is that you all know that the universe the normal distribution function is not explicit the normal quantile function is not explicit either but still our use is exactly that method in method which you see you to generate from it and that's something on the technical level I think it's important to know now because we can approximate the normal quantile function to such a precision into in such a fast way that this method is is feasible now so this is this is something interesting to know and then of course if I generate two more random numbers you see I get something different different than before obviously but I'm interested in reproducibility I would like to be able to reproduce a resulting result huh so that's why I need to set the seed I need to specify well in that sequence of random numbers that look like random numbers but are actually not random numbers now where I am and once I specify that seed I know which random numbers are produced next so if I first set the seed draw two random numbers from the normal and then set set it again to the same seat so essentially go back along that sequence and true draw again to two normals then you see aha they are the same now why did I use all equal here and not equality signs now that's a small exercise for you look up the help file of all equal very good to know now that function so that's more homework for you so to speak good so this is all I wanted to do here concerning the introduction now I want to do a little bit in our concerning valued risk the expected shortfall estimation yeah so I've learned this morning that okay so critical told me that okay value at risk is essentially the point help function and then expected shortfall is essentially the mean of overall losses that exceed value at risk so if I have a bunch of losses then my my distribution find my empirical distribution function looks like this I jump by 1 over n FF n losses at the outcomes at my losses might at each loss I jump by 1 over N so I have a step function so so nonparametric Lee speaking if I don't want to assume a certain model this is the best I can do I can replace my distribution function here by the corresponding impaired the distribution function or the quantile function by the empirical quantile function and the way this thing works is simply aha f is a certain confidence level alpha and then I go over exactly in the way that we really got showed us this morning and that's gonna be my corresponding value at risk based on this empirical distribution function so it's my empirical quantile at the level alpha so this I could use as an estimator and then I could simply use that estimator in here right so i block my estimator in here and I average overall losses exceeding that estimator to get an estimator of expected shortfall I also expected shortfall estimator at level alpha would be that but I want to compare the two and like what to investigate how is the variance of that estimator for example how do i compute this is it really true that that my Value at Risk estimator is smaller than my expected shortfall estimate on these things so in that we will we will look at it now so there are lots of comments here about how to derive these estimators and so on you don't need to know the detail at this point but if you like then go back and and read in the in the script and I take 2,500 losses of our 10 years of daily data which is of course a lot but still yeah so this is what I do at the moment and then I would like to repeat this estimation procedure a thousand times to see how these estimators perform and as the true underlying distribution is the true loss distribution I assume a Pareto distribution with parameter to the Pareto distribution has this distribution function I put it down here and yeah so the quantile function is very explicit - and I can use that to simulate losses from the Pareto distribution with the inversion method we just discussed and that's going to be my losses and then I do nonparametric estimation of value at rest an expected shortfall based on these losses and then I I resembled these losses a thousand times and each time I do value address the expected shortfall estimation so instead of just one estimator I get a thousand estimators and then from these thousand estimators of value eight with the expected shortfall I can then compute or approximate a variance I can compute confidence intervals and so on the technical term behind this is known as the nonparametric bootstrap and the full version of the slides has a bit of around debt in the appendix if you are interested in more about yes yes correct good so that's technically the same obviously yeah yep so I I see that would be flat yeah okay instead of generating a thousand times 2500 resampled losses if you just generate a thousand times 2500 losses that would indeed be different now because here this is the this is the nonparametric bootstrap so there it's different I resembled the same data I do not generate more data so this is a realistic set up yeah there's a lot going on here yeah so stay stay here that's important stay here there's a lot going on you good so let me set the seat because I want this to be reproducible if you go home and you do this and you get a different result you play me so that's why I set the seat so it's reproducible and then I generate my base losses from the Pareto exactly with that inversion method as I apply the Pareto quintile q power to n2 by 2500 uniforms with a Pareto parameter tool and the confidence level I look at here is 99 percent and I compute the nonparametric valued risk estimate an expected shortfall estimators and so you see you get eight point one six and thirteen point four two and those are just two numbers I can believe or not I don't really learn anything from that right just single numbers what I would learn from a little bit would probably be to look inside the arm function and see how its defined and you see my valued risk is indeed just the quantile function appear the quanta function at my alpha confidence level so there's not much going on now and my expected shortfall is simply this is simply essentially taking out with an indicator or losses that exceed my value at risk estimate which is computed here and I take the mean over them now I do this in a vectorized where you can feed it with several confidence levels at the same time and so and so it's slightly smaller than that but that's essentially what's what's happening good now what could be look at those such as signal numbers aha functions we wanted two functions in the confidence level to see the behavior of the estimate so here I and we are especially interested in large alpha so what I'm doing is I concentrate the alphas near one with a sequence one minus one over ten and then some power just to make sure they are all in between zero one and then I evaluate valued risk and expected shortfall aha there were some warnings here aha there was no Delta well some cases whatever only five losses bigger than valued risk for three one and so on down to zero losses and this is clear if you think about we have only a finite number of losses once by my confidence level is too large my Belliard risk estimate is going to be the largest loss and then what do I have left in the tail nothing now so all my losses strictly exceeding here strictly exceeding value at risk how many do I have no no and here's already see the difference between the theory and in practice in practice it does make a difference whether I have equality or now because then I would get a number and if I have equality here then I would always be for sufficiently large alpha I would always average over the last the largest loss and at least that would give me a number now I chose to use strict inequality here so you don't yeah but there are no losses exceeding so the estimator simply cuts off so it does make a difference and that's also bit the challenge when working in the software so in fury I could realign on the computer I don't now so that's that's a big needle the problem good so I have these these disaster metals and I also have the true ones for the true underlying Pareto distribution very good exercise do that compute value at risk expected shortfall for the Pareto distribution yes a question that's a different story of course they are I mean you can you can do a lot there but yeah I just fix it large enough at the moment it's not the point we are we are interested in at the moment to to know how large I should choose the number of which replications or anything like that they're good and then I thought all the computed values so far so the true value at risk expected shortfall and the corresponding estimators and if you are very honest then you blame me now and you say I don't see anything now and that's true and please don't try to see anything from this picture because it's not possible yeah you see it for larger ifs we are interested in it suddenly increasing very steeply you don't see that interesting region we are interested in so what could I do I could scale down these losses here to see more what's going on here so I could use a logarithmic y-axis and then I already see a bit more for example I see what both of my estimators are increasing that's what they should be I see that my value at risk estimator is picking up the true underlying value at risk quite nicely the expected shortfall estimator has more difficulties why does it have more difficulties because it needs to consider the holy tail of the distribution and not just the quantile so it has slightly more you can also say that the you can also kind of already expect that the variance of this estimator will be large but I'm still not happy here because I still don't see the full picture which is for the really large alpha values so what I could do is I could also use log scale on the x-axis and then plot it in 1 minus alpha that's by the way exactly the blood that critical showed you this this morning with the normal and the T model in in log scale they are valued risk being larger or smaller than for the normal versus versus the T and that script is online as well so you can reproduce that so I plot it in 1 minus alpha and in log log scale and suddenly I see a bit more thing sound so I see that my X my value at risk estimator suddenly is the constant only but what is that constant what is that constant question for you can anybody help me on well why is my value at risk estimator suddenly constant here and what is this value yes very good it's my highest value exactly because for sufficiently large alpha I will always return this largest value so I can actually read off my largest loss in this plot and I'm going to be flat because I have a step function good and then my expected shortfall estimator would actually be the same if I use the equality here then I would simply average over the largest loss which would be the largest loss but it drops off here because I have the strict equality here but if I also see that the expected shortfall estimator is off from the from the true underlying value much earlier than ignorant value at risk so it is more difficult to estimate expected shortfall then when you'd risk at least nonparametric and all of what we do here will motivate very much parametric or semi-parametric estimators you will learn about to tomorrow in EVT so the nonparametric estimators are by far the best ones but I think it's a very nice example for you to learn what you can do with a nicely made plot in our to learn about the theory and enter practical applications of it so you can take away quite a bit of of this interaction and you can play around in our that's that's the nice part and of course the the true underlying quantities they do they don't cut off here for continuous distribution there they'll just tell but you see this is definitely a region of ifs we are interested in and you see how far you can be off essentially now I want to I have this one sample of size of thousands as I thought of size 2500 and I computed these values when you just expect a chart for s functions in alpha based on that very same sample now in order to get a notion of of a variance or a confidence interval around that I could resample the data now so I just assumed that's my data I assume I don't know anything I don't assume I know the underlying potato right I just kicked us out of the game and then what I the best I can do is I believe that my my empirical distribution function underlying based on the losses is by true underlying distribution function and if I believe that I have to resample I have to sample from that empirical distribution function and what this does is just recently from the data that that's what it means and so I resemble a thousand times from those losses with replacement and then each time I do exactly that i compute Value at Risk expected shortfall estimator exactly in the same way as we did here and then I can based on for each alpha I have done instead of just one estimate I have a thousand and I can cut off the lower 2.5% upper 2.5% to get a 95% confidence interval I computed at it for each alpha the variance over these thousand values as an approximate variance estimator of of value of the nonparametric value address the expected shortfall estimators and I can still plot everything in a single plot them good so let's do that so I need to bootstrap everything here and then of course for small alpha I slowly run out of samples above value at risk so here you see the number of of the past percentage of any Enzo not a number III you get from the expected shortfall estimator if I if I decrease my if I increase my alpha now you see around 99.9% roughly you see that my yeah if less and less observations I'd actually average over that's of course critical that doesn't make sense if I don't have any losses anymore so tool to average over them but yeah that's just for me to see what's going on you and then in the end but do I get I get a a large plot in log log scale and which contains a lot of information here so essentially the true values of evaluate risk for the Pareto distribution and expect a chop for the larger line gives a line above the value at this and then I get my estimators so those other that the dashed lines so those are the ones you've seen before you and then here for expected shortfall the red one cutting off here essentially and then I get confidence intervals around them they're not too reliable because I have less and less observations as I go along but still you see for example that the confidence intervals around expected shortfall are larger than the ones around value addressed now so that's that's roughly visible here and you also see what the variance is doing so first of all the variances increase as alpha increases so the more the larger my confidence level the large of the bearings of my estimators which is of course a bad thing I also see that the expected shortfall has allowed to Berens over all that value at risk so it's it's I have to expect higher variances if I if I am eight expected shortfall then graduate risk again I look into the hollow tail not just a single one tile so so those are the types of observations you can take away from here and yeah so this is a very nice exercise over wall I think as I said before you have all the results I just mentioned down here again as a comment all of these scripts we have available online roughly burg similarly so you see a lot of comments there you can read through and yeah so that's it from my part and Alex will show you some more scripts and we'll continue with Chapter three then okay so we're gonna move on to chapter three which is called empirical properties of financial data and it would seem natural to immediately look at financial data if we're going to talk about empirical properties so Maris's has given you the fundamentals of what are is where you find it although I was I was pleased to see that a lot of you already know it some of you don't and for those of you who haven't seen it before and you're wondering what system to use I'm a little bit less hardcore than Marius so I'm gonna use our studio and so you'll see our studio that's the big white window in the background if you've cloned the repository as Marius suggested that's going to be the easiest thing to do if you want this code go to QRM tutorial and take the zip file which is the whole repository and in the our directory you're going to find this sub directory structure and so we're gonna go into chapter 3 and there is a there's a script in there called exploring financial time series data let's see so you see it here empirical properties exploring financial time series data so so that's where I'm going to begin and let's let's see how that is for size how can how well can you read that in the back row would you like a slightly bigger font size or can you can you read first thing that says in blue library XTS it's okay don't hesitate to tell me to blow it up a little bit but the smaller we keep it the more you can see in one window so the more you see of what's going on so so this is our studio there are three panels here which are going to be active my pre-prepared script is up here and commands are executed be echoed down here over here various things will appear and one of the first things we might look at in our studio is the list of packages so there is a list of packages here which I have available if the box is checked if the box is checked that package has already installed but I'm going to need to further packages for my presentation of this chapter so let's begin with library XTS this is a time series library we're going to spend the rest of the afternoon essentially looking at time series so XTS is extensible time series it's not my work it's not Maris's work but it's we we only recommend what we think are good quality packages if they are being entered on the repository at the rate of 100 a day it's fair to ask whether they are all uniformly wonderful packages there is a certain amount of quality control but we are relatively sparing with the packages we use so XTS I think is the the top time so you'll see there if you look at the task view Marius showed you the task view you will see that there are many time series packages but we've gone on to XTS the next library is one of ours so qrm data and let's have a look at what's in qrm data ok datasets for quantitative risk management practice this is my effort in Maris's effort actually more Maris's effort than the my effort said it would be nice to have some data I said I would settle for a few exchange rate time series and a few stock index time series and a few stock price time series I said maybe we should have some Dow Jones stocks Maris said well let's have them all I said maybe we should have a few Standard & Poor's stocks Mario said well let's get them all going back from well we this version of the library goes up to the end of 2015 so it's not real-time updated but this is historical data to play with up to the end of 2015 the intention is that we will update the repository yearly we had to ask special permission to sorry to update the package yearly we had to ask special permission to have this package because it's considerably bigger than the the standard default package size but in here what you will see a foreign exchange rate data stock index data what else some commodities the gold price oil price some volatility index data just lots of things to play with some interest rate data which will be in the subject on or one of the subjects on Wednesday morning so in this script we're going to look at some of it so for example this one here DJ Const that's the Dow Jones constituents so we'll have that and immediately a command that many people like to use to get an overview of what an object is everything is an object DJ Const is an object it is an XTS object it's an extensible time series you can see the date range 62 - New Year's Eve 2015 and other more cryptic information that true our enthusiasts will be able to interpret to some extent but I won't go into so some of the basic things you need to do you always want to pick out time periods so for example we will pick out the time period that goes for at the end of 2006 to the end of 2015 so that basically the twenty-ninth that gives me everything basically in 2007 and up to the end of 2015 it's a very it's a very natural syntax you just put the date range in quotes and we'll take the first 10 stocks so I'll pick up the speed a little bit here and we'll plot them plot zoom it seems like a strange name for a command plot zu zu is a there's another kind of our object that is intended for data with the natural ordering time series of a natural ordering in time and plot zoo that shows us the first ten Dow Jones constituents when we talk about the empirical properties of data we will look at returns return data and by default we will generally take I think I think we will almost always take log returns and so there's no point in having a function to compute log returns log returns are log differences take the log of the value difference it and then there's just one little twist here because if you go from a time series of length and a few different said it becomes n minus 1 and when you apply these operations to a time series it returns something of the original length but the first observation is n a so we just remove it so don't worry about this minus one every time we definitely we take a log different series that will appear so DJ dot X is the series of of log returns there were 10 stocks head gives you the the first six and so this is the first six of the last four there's Walt Disney and Goldman Sachs these are log returns and again plot Zoo these are the kind of time series we want to discuss in this chapter three they show something that we will call informally volatility I think that's quite clear anyone remember the year 2008 there was a lot of volatility in 2008 2007 2008 and to 2009 we remember that as the financial crisis although interestingly I now teach MSC students who are young enough that they don't really remember that very well let's see it's very fresh in our memory I think so we see volatility what it will have to describe that mathematically in some way but it's the sort of thing you know when you see it volatile periods and then quieter periods later on another thing we're going to want to do a lot is change the timeframe so in qrm data largely we have daily data in fact I think most of the times here is a daily data not higher frequency we really would have problems putting a big high frequency database on the cran archive and not lower frequency because of course we can construct lower frequency from from our frequency daily frequency and it's very easy to get weekly returns monthly returns quarterly returns yearly returns in the XTS package there are these commands apply weekly now the nice thing well one of the nice things about log returns is if you add them up within a week you get weekly log returns or if you add them up with them in a month you get monthly log returns so in order to get my weekly returns I apply weekly summation I sum all the columns I use the function column sums obviously I reduce the quantity of data I have by doing that I don't think we looked at the size of this data of the daily data as so I'll just type that in the bottom so we started with we started with we started with two thousand two hundred and sixty six days and then of course you sum them up within weeks and we have four hundred seventy weeks those of the weekly returns you still see the volatility and weekly returns that I can compute monthly returns we have a hundred eight months hundred eight months those are the monthly returns you still see volatility in the monthly returns and so on apply quarterly there are thirty six quarterly returns so what do we have we have nine years of data and those are the those are the quarterly returns so that's just some stock data and just to show you a few more stock indexes what shall we take well we'll take the S&P 500 the world's biggest financial market United States will take the footsy I work in the United Kingdom in Edinburgh and of course we'll take the SMI you can / you can look and see what you've got here the SP 500 goes back to 1950 and we have the constituents of this S&P 500 going back to when they start to 2000 okay not quite so far not so far we've still got lots of data in there okay that's the S&P 500 since 1950 sometimes you want to join some time series together you want to take for example the S&P 500 the food Sealy SMI and you want to merge them and you might start by taking everything you've got all is true and having a look at that so that's that's everything we've got we've got less of the less of the Swiss and less of the UK and more of the American so we might then just select we might then just merge the dates that we have in common across all the time series that's all this false retain only days where the indexes all have values and so that's our index data for example and so this is just reinforcing the basic things we always tend to do log differences I log difference them all let's plot the log differences plenty of volatility there there's 2008 again another thing that you'll see us do a lot pair plots because tomorrow we're going to yep it appeared pair plots tomorrow we're going to get into the multivariate structure of data today I will today I will deal with the single time series largely but tomorrow we will add down we will go into dimension D so quite often we learn things by looking at pairwise scatter plots of each possible pair again we could aggregate those things I won't bother in this case let's just look at a few other odds and ends exchange rates I've taken here the base currency to be the US dollar you can see the pound you can see I've taken the pound the euro the yen and the Swiss franc and I'm going to have a look at those so all of this is in qrm data there's the first few there's the first few values behind me here's what those exchange rates look like log returns again typically we would look at log returns I think to express the way that exchange rates change so that would be that would be log returns remember this event here was when the Swiss franc the dollar that was when was that yeah yeah so an extreme value there caused by this on pegging right yeah yeah removing the floor so exchange rates another thing we have in here a couple of of bonds or yield curve databases so zero coupon bond yields perhaps we let's have a look and see what the quality of the documentation is like for this one yeah interest rate data we have zero coupon bond yield curves and Canadian dollars in US dollars again stored as XTS time series so these are quite nice for exploring the way that yields change for different times to maturity so let's attach that data you'll note you'll have noticed by now I think that to use these data sets you first use the command data that makes the data available now the dimension here this is six thousand and eighty eight days and a hundred and twenty different times to maturity so for example if I if we look at the first six lines and if I spread this out a little bit you can see that what we have here we have the I need to go back to the start of it yeah what we have here starting in 1991 in January we have new yields for time to maturity quarter of a year half of a year three quarters one year so it goes up in forty years to thirty years in this data set I will change the yields to percentages and plot a few so let's plot the first ten times to maturity so that's going up quarterly to two-and-a-half years so these are the actual yields from 2002 to this is okay this is a slightly older database this one ends in 2012 in general most of the datasets do go up to the end of 2000 and no it's not it though we do have it up to the current time it's just that I selected a period beg your pardon I selected that period we do have them up to the end of 2015 I will take simple differences here rather than log differences with negative exchange rates there are problems about taking log differences so I'll take simple differences although all of the data in here are actually positive but I'll take simple differences I'll pick three maturities and we'll plot them so this would be daily changes in the yield to maturity daily changes and yields for time to maturity a quarter year two years and 10 years daily changes these are likely to be quite dependent at least for example the changes in the two-year yields and the 10-year yields quite depor how do I know that when I look at the scatter plots I see a lot of positive correlation I have a hundred and twenty different times to maturity if I'm going to model bond portfolios I'm probably going going to want to reduce the dimension in some way so we will talk about factor modeling beginning tomorrow for reducing the dimension of these kinds of risk factor data that's probably probably enough just looking at data and so and I now want to take you back to the slides and discuss some of the things that you've seen in these data in a in a more systematic way we call these the stylized facts of financial return series I don't think we are we were the first to call these a stylized facts it's a term that's been around what is a stylized fact I find it a very strange term but it's basically an empirical observation I'm not saying this is an absolute truth but it's something that we observe again and again and again without exception so it's an empirical observation and [Music] logical consequences of these observations which apply to many financial time series of risk factor changes so you heard and in Chapter two this morning that we deal with risk factors we called them X so the risk factors are prices rates later on we'll have things like volatilities so interest rates exchange rates prices index levels these are the risk factors and we're interested in the changes for many of them we will take log differences occasionally we will take ordinary differences so these stylized facts apply to those kinds of time series and the timeframe I have in mind is daily so I'm typically thinking of daily but most of what I say will apply equally to weekly or - to weekly or to monthly it may even stretch to quarterly I I'm not going to talk about stylized facts of high-frequency data so tick by tick data would have their own collection of stylized facts things that we observe and as for very low frequency data such as annual changes in risk factors well frankly we don't have a lot of data to say things about annual changes because you can do things like make overlapping annual returns but still that doesn't contain a lot more information than simply taking non-overlapping annual returns and that maybe introduces a question to me which i think is interesting interesting to me and to post me to my co-authors at this point it might be quite useful to just work out what sort of applications interest you in the room largely your actuaries who here deals with work that is created by solvency - in insurance what would you say maybe not as much as a third who here deals with work that is created by the Basel regulations in banking okay even less actually I need to think of some other lines of work actuaries but I don't know who yeah and a lot of you are PhD students so well depending on whether you're interested in solvency or Basel of course there are different time frames that are natural so it's ovince II you're supposed to be looking at annual loss distributions in computing a solvency capital requirement based on annual losses but I think when you model the way that risk factors behave you still want to build up your knowledge of how risk factors behave by modeling at a lower frequency perhaps monthly maybe weekly or something like that in banking of course short time intervals are of interest the word there is hardly any hands went up when I asked about interests in Basel work but certainly for the trading books of banks the canonical time horizons are either 1 day or 10 days and so they're interested in the kind of changes you observe at those frequencies but I think even for solvency to work when you model these kinds of risk factors I believe it's better to be modeling at a at a higher frequency than trying to simply mortal annual changes where you don't have so much data okay yeah yeah or you can simulate you we can fit some time series models and we can simulate their consequences over one year build scenario generators perhaps not easy to get a picture of of annual changes in fundamental it's not easy to risk factors but so simple scanning laws but also simulations of type of the kinds of time series models we'll get on to might be worth thinking about so if it's a risk factor it'll be if it's a risk factor change it will be X if it's a risk factor itself it will generally be Zed although later on we'll have some innovations which are called Zed but generally the X's are risk factor changes and for a price the typical risk factor would be a log price and so the risk factor change would be a log difference so when I in our when I did diff log those were logarithmic differences that was computing this the logarithmic difference of prices and the logarithmic difference of prices at least over small time intervals is not so different to the relative difference in prices for small time intervals they are quite close but generally by default we will take log returns so what are the stylized facts well one of the things we saw very evidently in a lot of those time series of log returns was volatility clustering this is an old example that we've had in the textbook since the first edition the the the dax index in Germany in 1985 to 1994 we had it in there because you can tell a story a historical story around most of those periods of volatility for example this volatility here in late 1987 and this volatility here in this would be very late 89 and then this this extreme volume here in 91 were related to political if that's the fall of the Berlin Wall that's the that's the push in Moscow in the time of Gorbachev and that was the Black Monday events on the stock market but you see volatility in some of these periods what is volatility mean it's periods of large changes in value but with all often with rapidly alternating alternating signs so it goes up it goes down it goes up but the magnitude of the change is large and you tend to get large changes followed by other large changes although not necessarily the same sign if you simulate independent data you're never going to get that so there are two simulated time series here the simulated time series are of the same length as the real time series and moreover they are statistically fitted or they are from models that have been statistically fitted to the real data so so fit a normal distribution to that and then simulate the fitted normal and of course you get something that looks nothing like the original it the scales are the same so it simply can't fill out this white space in the same way that the real time series does so fit a student distribution if you fit a student distribution you have three parameters to estimate you have a location you have a scaling you have a degree of freedom that controls the weight of the tail and you get something that begins to look like it at least kind of explores the same range of values but neither of these two contain volatility they're independent data you don't get the clusterings of large changes followed by the if you like the clusterings of small changes you don't see that so I think I've said everything here the simulated normal data to few extremes the simulated T data that's the degree of freedom a better range of values but no volatility clustering so whatever time series models we use to describe time series like the first one somehow they should attempt to reproduce this volatility clustering what else now I'm going to talk about Auto correlations before I actually define them they come in chapter 4 bit who's familiar with a CF plots again it's maybe 50% so so what's an ACF plot this is a picture of the estimated serial correlation structure basically and so what you have on the x-axis it's called lag and a lag of one would mean the estimated correlation between returns on day two and day t plus one or returns that are separated by one day and time so that would be a lag one and obviously a lag three would be the correlation we would be looking at the correlation between returns on day T and day t plus three and we estimate the magnitude and we draw a vertical bar and basically you can't see anything there are little notches here but these vertical bars are very small in fact they lie between two dotted lines what you actually see here is is estimated serial correlations which are very very small in fact you not significant at all and that is common for all three time series it's the same three time series I think I'm having a problem with the pointer it's probably a battery ah can we edit that [Music] two million pointers well there's old-fashioned technology so these are the three time series in question and these pictures here on the Left relate to those three time series so you see you can use the laser pointer at least you can see Dax normal T basically you see zero evidence of correlation within these time series the real data and the independent data now the independent data there can't be any correlation they're independent independence means no correlation but then you see something very interesting which is you take the absolute values so you remove the signs and you take the time series of absolute values the magnitudes and then you see a lot of correlation persisting to actually quite high lag so this is telling you that there is estimated correlation between the absolute values of these time series data and that's consistent with the idea of volatility clustering where large values are followed by other large values although not necessarily of the same sign but again for the independent data you see nothing because it doesn't matter what you do to the independent data you can square them you can take absolute values you're still not going to create serial dependence they're still going to be independent so it's the difference between this picture in this picture which is the signature of a real financial time series whereas this is what independent data do so think I've said everything here AC f stands for autocorrelation function in fact these are estimates of the autocorrelation function northern 0 AC F at lag 1 implies a tendency for return to be followed by a return of equal sign not the case here but this shows a tendency for a return of larger magnitude to be followed by another return of larger magnitude okay there are other tests we can do on this there are other tests of serial dependence which are based on the the autocorrelation function or the estimated autocorrelation function they often go by the name of Jung box tests or related tests with which you can use to test the dependence structure let's move on to the next stylized fact geometric match oh yeah I didn't I'm not I'm not gonna say everything at every slide but do stop me if there's something which looks interesting for an iid process you would expect what's called the autocorrelation function to be the zero function or rather the indicator function for H is zero you're just going to have a spike at one and then zero that's not the case here in the real data it's not the case it is the case here it looks like it is the case here but not here so what can we say we can say that because these are not consistent with independent returns that means that the risk factor is not consistent with a random walk or if you like the price process is not consistent with a geometric Brownian motion okay other things we can look at we can look at the extreme values and how they cluster this is a graph that I've always quite liked or an exploratory method draw draw a kind of high threshold and pick the largest observations either the largest positive ones are the largest negative ones I think here we've taken probably the largest down movements if I remember rightly we've taken the largest down movements but we've made them positive and look at the gaps between them now the theory says this really belongs I think in the extreme value theory captor coming up tomorrow the theory says that if you have a process of independent data then when you look at the extremes so you draw a high line maybe you take the top 10% of our the top 5% of values but in the limit as you take a higher and higher threshold and just look at the largest observations what you get in the limit is a Poisson process and what we know about the Poisson process is that the waiting times between extremes are exponentially distributed so that's the reason that we look at the extremes and we look at the spacings between them and we compare them with a exponential distribution that introduces the subject of the QQ plot actually the theory the sequencing is a little bit out here the theory will come later but let's just say informally what a QQ plot is again I probably there are people here who haven't encountered it and many who have but in a QQ plot we compare data against a reference distribution we compare the quantiles of data against the quantiles of a reference distribution and what we're looking for is a straight line so what I do is I compare the spaces between extreme events I order them and I compare those spaces against an exponential distribution and if I have exponentially distributed spacings I should see a straight line of points which I don't see but if I repeat the same experiment for independent data I do see this kind of straight line pattern well it's never it's never going to be perfect but for iid data the process of extreme values converges to Poisson with exponential spacings for real data it doesn't that's another stylized fact the DAX data shows shorter and longer waiting times okay so what about distribution well real data real log return data certainly for daily log returns weekly log returns monthly log returns and there are some nice scripts in the repository that confirmed this empirically it's not going to be normally distributed maybe by the time you get out to quarterly log returns or annual log returns but once again we don't have a lot of data you are getting more normal because you're adding up more and more higher frequency returns and so there's a central limit effect but for daily weekly monthly you can demolish the idea that they are normally distributed they're not and this slide just has a whole list of tests that could be carried out and which indeed are carried out in one of the scripts in the repository that I'll show you in a in a sec well I'll I'll show you where it is we won't go through that one in particular but we have lots of tests of normality we have some tests which are not specific to normal there are general tests of a set of data against a candidate distribution and quite often they are constructed through the empirical distribution function that Marius has essentially defined through a picture here and then we have some plots that are specific to the some tests which are specific to the normal distribution one that I often default to is York Berra because it's a test that simultaneously compares the skewness and kurtosis of data with that of a normal distribution sort of it jointly compares skewness and kurtosis and tests against a chi-squared distribution so another stylized fact is that another stylized fact is that financial data are not normal I mentioned QQ plots and these belong to the realm of graphical tests there are two varieties pp plots which are probability plots and QQ plots which are quantile plots they are based on the order statistics of the data that as you have a data sample of size n you order them from smallest to largest and these order statistics contain all the relevant information about the data from them you can get the empirical distribution function as well and the PP plot is a comparison basically of theoretical probabilities with empirical probabilities the one that we will most often use is the QQ plot so let's just talk about that one we compare theoretical quantiles of a reference distribution with empirical quantiles of the data so the order statistics can be thought of as the quantiles of the data and they should correspond to the quantiles of the reference distribution if that distribution is a good fit if you like what is P I here P I is the plotting point it's just I've been equality the first one I minus 1/2 over n let's say I minus 1/2 over N or some little correction that's what P I is because this should correspond roughly to the to the I minus 1/2 over N quantile all right sorry this should correspond roughly to the I minus 1/2 over N quantal anyway it's easier to look at the picture and to learn how to interpret it this is look at the pictures yeah so you do this kind of thing there's Walt Disney there's the quantiles of a reference distribution normal and here are the plotted against the ordered sample or the order statistics and what you typically see is an inverted s you see an inverted s and that is the sign that you're dealing with data which tend to have bigger quantiles than the reference distribution for location scale families you can basically use the standard form of the distribution you don't have to fit it so you get the best estimate of MU in Sigma you can just plot it against quant hours of the standard normal for a location scale family this is a little experiment with simulated data data from a t3 distribution so you have if you like where you have red dots and black dots and so the red dots this is the QQ plot of the data against a normal reference distribution you see the inverted s in the red data just like this you see the inverted s now the true distribution of those data is actually T 3 and so the other quantile quantile plot the black dots is a QQ plot against T 3 and it's theory says it should be linear but you're never going to get perfection when you're dealing with a heavy tailed distribution these largest values can be quite erratic and so you have to learn a little bit what a good QQ plot looks like particularly when you're dealing with a heavy tailed reference distribution most often we use the normal reference or in the case of these pictures that you saw before we use the we use the exponential reference for the spaces but that's the QQ plot so let me summarize the stylized facts the one dimensional stylized facts we return series are not iid that I I will assert they are not they cannot it's maybe it's better to say cannot be modeled by iid return across their data and iid is a concept really for random variables but they cannot be modeled by iid although it is true they show a little serial correlation the series of absolute values or indeed squared values shows a lot of serial correlation the your best guess about the the expected log return tomorrow is usually zero this very is very difficult to predict the next log return with any great success although you can have more success predicting the magnitude of the log return but not the side that is a little bit the problem so I haven't really talked a lot about the third stylized fact but that belongs to the list volatility appears to vary with time extremes appear in clusters and return series are heavy tailed or at least heavier tailed the normal longer tails than normal also a little bit narrower in the central in the center which is a pattern called lepto kurtik okay now in this set of scripts for chapter in this set of scripts for chapter 3 there are a few that you might want to play with I said there was a script on testing normality there's a very extensive script testing univariate normality largely written by Marius which runs through this arsenal of normal tests and demonstrates that financial data do not appear to be normal there are some young Boggs correlation test but I thought just to recap what we were saying but what I've been saying we would run through this script univariate univariate stylized facts dot R okay so we will do this with we'll do this with the S&P 500 the food see in the SMI which we created before I will just spring on ahead and create the data so that's our data now we want to have a look at these stylized facts and just check that they appear to hold for these data you can immediately see the strong volatility of the Standard and Poor's index returns or the SMI index returns so how to get the ACF plot there is a little function called ACF which estimates the correlations but it will do a little bit more it will actually give you a matrix of pictures and these ones in the center line here these are the estimated serial correlations of each of these series so you'll see that the little bars those are the estimates so obviously for the standard and poors the first estimate at lag one I'm trying to hit it here with the laser pointer is a small negative number as is the second one and then you hardly see anything and so the this is estimated serial correlations for standard and poors for footsie and SMI the other pictures are what are called cross correlation estimates so these are estimates of the correlation of let's say the standard and poors owned 80 and the footsie ond 80 plus 1 or the standard ampersand 80 and the SMI on day t plus 1 and on the diagonal here the first bar is always of height 1 because a correlation of an observation with itself is a hundred percent off the diagonal the first bar is basically just an estimate of the correlation between returns on the same day whereas the other bars are coral estimated correlations between days T and T plus h let's say so that's the picture you get and you don't see anything there's hardly anything happening in these pictures they are very dull pictures but then you do the same thing for the absolute values and you see this so you do the same thing for the absolute values and this immediately tells you that you have behavior that is nothing like a three-dimensional iid process there is strong and persistent correlation it appears between the absolute values of course you can also get this kind of thing with non stationarity which is an issue we will talk about after the coffee break what else it the weekly let's do the weekly so I made the weekly data before take the weekly returns look at the correlation pictures you see next to nothing look at the weekly absolute values you see lots of correlations so there's still an estimated nonzero correlation for weekly return the absolute values of weekly returns the lag the lag here is measured in weeks yeah one week two weeks yeah surprising what about the heavy tails there's a little block of code here which will cycle through it'll cycle through the three time series and you'll see it makes a QQ plot there's a function call it's standard are a function called QQ norm which will make a QQ plot in QQ line which will put a straight line on it a bit to allow us to judge whether we're dealing with plausibly normal data I'll just do it with weekly I'll just do it with weekly our weekly log returns normally distributed not really you get the inverted s for weekly log returns and we could do some formal testing weekly log returns are not normal monthly log returns are not normal or cannot be modeled by normal what about the clustered extreme values well there's a block of code that makes the that's the picture for sp the first index that's a picture for S&P we're supposed to be we're looking for some kind of straight line here but we get it something it looks nothing like a straight line so I just repeated the experiment for I repeated the experiment for iid data so in this block of code here what we do is we simulate student-t distributed data just draw your attention to RT RT gives you a random sample from a T distribution the degree of freedom is 5 and it makes a land of sample that is the same length as the true data and then we pick out the extremes we plot them well actually let's do it in yeah perhaps should have done this in two pieces so this is independent data those are the extremes we've picked out the top hundreds if you like the hundred most extreme and then we look at the spaces between them and do the QQ plot and you get something which is more plausibly a straight line whereas perhaps I should go back to the real data perhaps this is the best order those are the extreme of the real data here's 2008 again nothing happened in here nothing happened in here those spaces are not exponential I just I just picked it I just I picked something which is relatively small but not very small it doesn't really matter because this has got nothing to do with the the tails of the distribution and everything to do with the fact that I was simulating independent data if I did it for a normal it would be the same yep so so there's a script there that tries to illustrate these univariate stylized facts now you'll have seen that there are also some multivariate stylized facts here but I'm not going to go through through these at this point I may come back to these tomorrow when we talk about multivariate time series but there are just some observations on these pictures we've seen these are called the this picture here one of these is often called a correlogram correlogram a picture of estimated correlations and this is called a cross correlogram between two time series and so once again you see nothing much happening in these pictures all the little bars are close to zero and I'll talk of it exactly about how this picture is formed when we when you go to the absolute values you see that they are not so negligible so the final thing if you give me one minute where'd I go guys yeah please yeah yeah yeah so it's a confidence interval for these for these correlation estimates and the theory says that if you're dealing with iid data then 95 percent of those bars so 19 out of 20 of these bars should lie between the red lines yeah and I just wanted to end with one picture because again this hints at the direction that Marius and Paul will be going tomorrow back to BMW in Siemens here's the history so I think we looked at visit but did we look at well we looked at the DAX before now we looked at the dax index before but these are two returns from the dax index and we've been showing this for about 15 years if it's an old chestnut if you like because it's it seems like ancient history now so you can trace some of these extreme values back to political events in Germany and in the Soviet Union of the time if you do the pairwise scatterplot you see this and you see that in fact the three most extreme days are down here that's day to fall of the Berlin wall or well actually no not quite because of course that took place over a period of time but a certain significant day at that time is day two day three is the coup day one is Black Monday but you see them all down here in the joint left tail and a question for us is going to be what kind of models can capture this tendency for the most extreme values to occur together in a statistical way because really of course what we're trying to do is get a distribution that describes these events the occurrence of these extreme events and this phenomenon here of the most extreme events sort of occurring together in the tail we call it tail dependence and quite often when you look at several time series see this you see that the most extreme events kind of occur together they occupy these positions in the lower-left tale sometimes in the upper right tail and you get a very pointy tail you see this is a very pointy tail because of that we need models that can reproduce that sort of behavior if we are to model the dependencies between extremes and that's just a little hint at where the story goes tomorrow okay any burning coal sorry I think Francois do you have something to say no but any burning questions you you get a bit more of me after the break so you can store the mop
Info
Channel: QRM Tutorial
Views: 1,647
Rating: 4.8000002 out of 5
Keywords:
Id: iZihXwnRkd8
Channel Id: undefined
Length: 91min 6sec (5466 seconds)
Published: Sat Jan 20 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.