Multivariate Normal distribution and Cholesky decomposition in Stan

Simulated data in a Multivariate Normal distribution
Figure 1: Simulated data in a Multivariate Normal distribution

This post provides an example of simulating data in a Multivariate Normal distribution with given parameters, and estimating the parameters based on the simulated data via Cholesky decomposition in stan. Multivariate Normal distribution is a commonly used distribution in various regression models and machine learning tasks. It generalizes the Normal distribution into multidimensional space. Its PDF can be expressed as:

PDF of a Multivariate Normal distribution

where mu is a vector of k elements for the location parameters and Sigma is the Variance-Covariance matrix. The |Sigma| calculates the absolute determinant of Sigma in the equation above.

Cholesky decomposition is used to decompose a positive-definite matrix into the product of a lower triangular matrix and its transpose. In our case, we will decompose the correlation matrix (R) into the product of two triangular matrices (L*L’). Note: even though we can directly perform Cholesky decomposition on the Variance-Covariance matrix (Sigma), it is not recommended, since we may fail to estimate the standard deviations (sigma) of each variable.

Matrix representation of Cholesky decomposition

1. Data simulation

The variance-covariance matrix (Sigma) for data simulation and its analytical solution via Cholesky decomposition are provided below. For example, the standard deviation for the first variable sd_1 is 0.5, and the correlation coefficient (rho) between first and second variables is 0.7 (see Figure 1 for more intuitive illustration).

Decomposing an example VCV matrix via Cholesky factorization

To generate the data, we are using the mvrnorm function from the package MASS by specifying a vector of means (mu) and variance-covariance matrix (Sigma).

mu = c(1, 2, -5) # means
R = matrix(c(1, 0.7, 0.2, # correlation matrix (R)
0.7, 1, -0.5,
0.2, -0.5, 1), 3)
sigmas = c(0.5, 1.2, 2.3) # sd1=0.5, sd2=1.2, sd3=2.3
Sigma = diag(sigmas) %*% R %*% diag(sigmas) # VCV matrix
data = mvrnorm(1000, mu = mu, Sigma = Sigma) # data simulation

2. Parameter estimation in stan

With the pseudodata generated above, we can try to recover the parameters in the multivariate normal distribution. Here we are using a probabilistic programming language, stan, to estimate the parameters of the multivariate normal distribution based on the previously simulated data. The stancode block is attached below, and you can directly embed it in R.

With the compiled stan model, we can start the sampling process and summarize the estimated parameters.

input = list(N = 1000, K = 3, y = convertRowsToList(data))
results = sampling(stan_MVnorm_model, data=input,
chains = 4, iter = 2000, cores = 4)
Figure 2: Traceplots for the estimated parameters in the Multivariate Normal model

Useful link:

Programming, Data analysis & Deep learning!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store