Contents: Problems; Kernel Density Estimation; Missing Data
Problems. We have non-normality of innovations. Even when they are independent identically distributed, we cannot guarantee their normality. What is more, different series of innovations have different lengths. For example, in our current version of the simulator, we have five series of innovations. They are available in the Excel file innovations.xlsx in GitHub repository asarantsev/simulator-current
- Autoregression for log volatility: 96 data points
- S&P stock returns: 97 data points
- International stock returns: 55 data points
- Autoregression for log rate: 97 data points
- Corporate bond returns: 52 data points
In the previous version of the simulator, I simply used the multivariate normal distribution. We can compute the empirical covariance matrix by ignoring the missing data. And the means are, of course, zero. But, as mentioned above, the distribution is not normal in fact. To be more precise, the first and fourth components are not normal. In particular, their kurtosis is greater than that of the normal distribution, and the skewness is nonzero.
I tried to apply other distributions to fit each of these two components: skew-normal and variance-gamma. But I failed to fit them well. What is more, fitting univariate distribution is not enough. I need to combine them with normal marginals for the other three components. I did not find relevant exact literature. This would require developing entire new theory of multivariate distributions.
Kernel Density Estimation. This is a universal nonparametric method: (KDE). We apply Gaussian kernel: For we have the density
where is the probability density function on
for
which is the
dimensional Gaussian distribution with mean vector
and covariance matrix
In other words, to simulate a random variable
with such density, we pick
at random (uniformly) from
and simulate the additional noise
independent of
Thus
For we apply Silverman’s rule of thumb: This is a diagonal matrix with
Here, are statistics of the
th component of the data
Namely,
is the empirical standard deviation, and
is the empirical inter quartile range: 75% quantile minus 25% quantile. This is realized in the file simKDE.py from our main repository. We stress that this code simulates innovations, not computes the joint density function.
Missing Data: The other problem mentioned above is lack of data for some series of innovations. We considered many imputation methods, for example iterative imputer using Principal Component Analysis or k-nearest neighbor method. They are implemented in Python package sklearn. But such methods reduce variance, because imputed data reverts to the mean. I chose a custom designed approach. It comes in iterated steps. We describe one step below.
Step. Assume we have series of
independent identically distributed data points. Out of these,
have full
values, and the last one has missing
values. We then regress this last series (of course, only existing
values) versus the full
series (of course, only the matching
data points). We use ordinary least squares linear regression. Then we take residuals of this regression. We randomly choose with replacement
of these residuals. And for each of
missing points, we pick the predicted value by this regression using the first
data series, and add this randomly chosen residuals. This is the way to fill the missing data. This completes the description of this step.
We first apply this step to one missing data point for series 1 (using series 2 and 4 as the backbone), then to missing data points for series 3 (using series 1, 2, 4 as the backbone), and then to
missing data points for series 5. We write this new data frame into a separate Excel file called filled.xlsx. It is available in the same GitHub repository. The Python code is given in innovations.py in the same repository.
Leave a comment