# Multivariate Normal distribution and Cholesky decomposition in Stan

This post provides an example of simulating data in a **Multivariate Normal distribution** with given parameters, and estimating the parameters based on the simulated data via **Cholesky decomposition** in `stan`

. **Multivariate Normal distribution** is a commonly used distribution in various regression models and machine learning tasks. It generalizes the Normal distribution into multidimensional space. Its PDF can be expressed as:

where ** mu** is a vector of

**elements for the location parameters and**

*k***is the Variance-Covariance matrix. The**

*Sigma***calculates the absolute determinant of**

*|Sigma|***in the equation above.**

*Sigma***Cholesky decomposition** is used to decompose a positive-definite matrix into the product of a lower triangular matrix and its transpose. In our case, we will decompose the correlation matrix (** R**) into the product of two triangular matrices (

*L****

**).**

*L’***even though we can directly perform Cholesky decomposition on the Variance-Covariance matrix (**

*Note:***), it is not recommended, since we may fail to estimate the standard deviations (**

*Sigma***sigma**) of each variable.

**1. Data simulation**

The variance-covariance matrix (** Sigma**) for data simulation and its analytical solution via Cholesky decomposition are provided below. For example, the standard deviation for the first variable

**is 0.5, and the correlation coefficient (**

*sd_1***) between first and second variables is 0.7 (see Figure 1 for more intuitive illustration).**

*rho*To generate the data, we are using the *mvrnorm* function from the package **MASS **by specifying a vector of means (** mu**) and variance-covariance matrix (

**).**

*Sigma*`mu = c(1, 2, -5) # means`

R = matrix(c(1, 0.7, 0.2, # correlation matrix (R)

0.7, 1, -0.5,

0.2, -0.5, 1), 3)

sigmas = c(0.5, 1.2, 2.3) # sd1=0.5, sd2=1.2, sd3=2.3

Sigma = diag(sigmas) %*% R %*% diag(sigmas) # VCV matrix

data = mvrnorm(1000, mu = mu, Sigma = Sigma) # data simulation

## 2. Parameter estimation in stan

With the pseudodata generated above, we can try to recover the parameters in the multivariate normal distribution. Here we are using a probabilistic programming language, `stan`

, to estimate the parameters of the multivariate normal distribution based on the previously simulated data. The `stan`

code block is attached below, and you can directly embed it in R.

With the compiled **stan** model, we can start the sampling process and summarize the estimated parameters.

`input = list(N = 1000, K = 3, y = convertRowsToList(data))`

results = sampling(stan_MVnorm_model, data=input,

chains = 4, iter = 2000, cores = 4)

Useful link: