Modeling Innovations

Contents: Problems; Kernel Density Estimation; Missing Data

Problems. We have non-normality of innovations. Even when they are independent identically distributed, we cannot guarantee their normality. What is more, different series of innovations have different lengths. For example, in our current version of the simulator, we have five series of innovations. They are available in the Excel file innovations.xlsx in GitHub repository asarantsev/simulator-current

  • Autoregression for log volatility: 96 data points
  • S&P stock returns: 97 data points
  • International stock returns: 55 data points
  • Autoregression for log rate: 97 data points
  • Corporate bond returns: 52 data points

In the previous version of the simulator, I simply used the multivariate normal distribution. We can compute the empirical covariance matrix by ignoring the missing data. And the means are, of course, zero. But, as mentioned above, the distribution is not normal in fact. To be more precise, the first and fourth components are not normal. In particular, their kurtosis is greater than that of the normal distribution, and the skewness is nonzero.

I tried to apply other distributions to fit each of these two components: skew-normal and variance-gamma. But I failed to fit them well. What is more, fitting univariate distribution is not enough. I need to combine them with normal marginals for the other three components. I did not find relevant exact literature. This would require developing entire new theory of multivariate distributions.

Kernel Density Estimation. This is a universal nonparametric method: (KDE). We apply Gaussian kernel: For  X_1, \ldots, X_N \in \mathbb R^d we have the density

 f(x) = \frac1N\sum\limits_{k=1}^N\varphi(x; X_N, \Sigma)

where  \varphi(x; \mu, \sigma) is the probability density function on  \mathbb R^N for  \mathcal N_d(\mu, \Sigma) which is the  d dimensional Gaussian distribution with mean vector  \mu and covariance matrix  \Sigma. In other words, to simulate a random variable  Y with such density, we pick  U at random (uniformly) from  X_1, \ldots, X_N and simulate the additional noise  Z \sim \mathcal N_d(0, \Sigma) independent of  U. Thus  Y = U + Z.

For  \Sigma we apply Silverman’s rule of thumb: This is a diagonal matrix with

 \Sigma_{ii}  = \left(\frac{4}{d+2}\right)^{1/(d+4)}\cdot N^{-1/(d+4)}\cdot \min(\sigma_i, \rho_i/1.34)

Here,  \sigma_i, \rho_i are statistics of the ith component of the data  X_1, \ldots, X_n. Namely,  \sigma_i is the empirical standard deviation, and  \rho_i is the empirical inter quartile range: 75% quantile minus 25% quantile. This is realized in the file simKDE.py from our main repository. We stress that this code simulates innovations, not computes the joint density function.

Missing Data: The other problem mentioned above is lack of data for some series of innovations. We considered many imputation methods, for example iterative imputer using Principal Component Analysis or k-nearest neighbor method. They are implemented in Python package sklearn. But such methods reduce variance, because imputed data reverts to the mean. I chose a custom designed approach. It comes in iterated steps. We describe one step below.

Step. Assume we have d + 1 series of  N independent identically distributed data points. Out of these,  d have full  N values, and the last one has missing  M values. We then regress this last series (of course, only existing  N - M values) versus the full  d series (of course, only the matching  N - M data points). We use ordinary least squares linear regression. Then we take residuals of this regression. We randomly choose with replacement  M of these residuals. And for each of  M missing points, we pick the predicted value by this regression using the first  M data series, and add this randomly chosen residuals. This is the way to fill the missing data. This completes the description of this step.

We first apply this step to one missing data point for series 1 (using series 2 and 4 as the backbone), then to  97-55 = 42 missing data points for series 3 (using series 1, 2, 4 as the backbone), and then to  97-52 = 45 missing data points for series 5. We write this new data frame into a separate Excel file called filled.xlsx. It is available in the same GitHub repository. The Python code is given in innovations.py in the same repository.

Published by


Response

  1. Simulator Final Update – My Finance

    […] modeled innovations using kernel density estimation. There are 8 regressions, but one of them (the bubble measure) is used to create said measure, not […]

    Like

Leave a comment