Gaussian Models

Chapter 4 - Gaussian Models

Multivariate Normal (MVN) in $D$ -dimension:
$\mathcal{N}(\mathbf{x} | \boldsymbol{\mu}, \mathbf{\Sigma}) \triangleq \frac{1}{(2 \pi)^{D / 2}|\mathbf{\Sigma}|^{1 / 2}} \exp \left[-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^{T} \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\right]$
We can estimate the parameters of the Gaussian using MLE:

$\begin{aligned} \hat{\boldsymbol{\mu}}_{m l e} &=\frac{1}{N} \sum_{i=1}^{N} \mathbf{x}_{i} \triangleq \overline{\mathbf{x}} \\ \hat{\mathbf{\Sigma}}_{m l e} &=\frac{1}{N} \sum_{i=1}^{N}\left(\mathbf{x}_{i}-\overline{\mathbf{x}}\right)\left(\mathbf{x}_{i}-\overline{\mathbf{x}}\right)^{T}=\frac{1}{N}\left(\sum_{i=1}^{N} \mathbf{x}_{i} \mathbf{x}_{i}^{T}\right)-\overline{\mathbf{x}} \overline{\mathbf{x}}^{T} \end{aligned}$

Given a specified mean and covariance, the multivariate Gaussian is the distribution with maximum entropy.

Gaussian Discriminant Analysis

Imagine a generative classifier, where the features $\mathbf{x}$ are continuous and assumed to have class-conditional distributions that are MVN.
$p(\mathbf{x} | y=c, \boldsymbol{\theta})=\mathcal{N}\left(\mathbf{x} | \boldsymbol{\mu}_{c}, \boldsymbol{\Sigma}_{c}\right)$
Note that if $\boldsymbol{\Sigma}_{c}$ is diagonal, then this corresponds to the Naive Bayes classifier where the features are continuous.
For a generative classifier, we select the class that has maximum posterior prob, i.e., $p(y=c|\mathbf{x}, \pi, \theta_c)$ = class_prior x class-conditional_likelihood
( $\pi$ corresponds to the categorical distribution of $y$ - histogram of the data across the the different classes)
Written in log-space,
$\hat{c} = \hat{y}(\mathbf{x})=\underset{c}{\operatorname{argmax}}\left[\log p(y=c | \boldsymbol{\pi})+\log p\left(\mathbf{x} | \boldsymbol{\theta}_{c}\right)\right]$
If we have a uniform prior over the classes, we can ignore the 1st term. Hence,
$\hat{c} = \hat{y}(\mathbf{x})=\underset{c}{\operatorname{argmin}}\left(\mathbf{x}-\boldsymbol{\mu}_{c}\right)^{T} \boldsymbol{\Sigma}_{c}^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_{c}\right)$
Essentially, each datapoint is assigned to the class, based on Mahalanobis distance with the class mean (distribution centroid). Refer definition of Mahalanobis distance - Wikipedia.

Quadratic Discriminant Analysis

From Bayes rule,
$p(y=c | \mathbf{x}, \boldsymbol{\theta}) \propto \log p(y=c | \boldsymbol{\pi})\times\log p\left(\mathbf{x} | \boldsymbol{\theta}_{c}\right)$

Assuming prior to be $Cat(\pi_c)$ and Gaussian class conditionals,
$p(y=c | \mathbf{x}, \boldsymbol{\theta}) \propto \pi_{c}\left|2 \pi \boldsymbol{\Sigma}_{c}\right|^{-\frac{1}{2}} \exp \left[-\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}_{c}\right)^{T} \boldsymbol{\Sigma}_{c}^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_{c}\right)\right]$

Linear Discriminant Analysis

Note that in the above case, each Gaussian is assumed to have its own mean $\boldsymbol{\mu}_{c}$ and covariance $\boldsymbol{\Sigma}_{c}$

We can simplify this by making an assumption that the covariance matrices are the same across classes.

$\begin{aligned} p(y=c | \mathbf{x}, \boldsymbol{\theta}) & \propto \pi_{c} \exp \left[\boldsymbol{\mu}_{c}^{T} \mathbf{\Sigma}^{-1} \mathbf{x}-\frac{1}{2} \mathbf{x}^{T} \boldsymbol{\Sigma}^{-1} \mathbf{x}-\frac{1}{2} \boldsymbol{\mu}_{c}^{T} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{c}\right] \\ &=\exp \left[\boldsymbol{\mu}_{c}^{T} \boldsymbol{\Sigma}^{-1} \mathbf{x}-\frac{1}{2} \boldsymbol{\mu}_{c}^{T} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{c}+\log \pi_{c}\right] \exp \left[-\frac{1}{2} \mathbf{x}^{T} \boldsymbol{\Sigma}^{-1} \mathbf{x}\right] \end{aligned}$

Note that the second $\exp$ term ( $\mathbf{x}^{T} \boldsymbol{\Sigma}^{-1} \mathbf{x}$ - the quadratic term) is independent of the class $c$ . Hence, in the bayes formula, it gets cancelled out from the numerator and denominator.

Let’s define $\boldsymbol{\beta}_{c}$ to be the coefficient of $\mathbf{x}$ and $\gamma_{c}$ to be the constant term:
$\begin{aligned} \gamma_{c} &=-\frac{1}{2} \boldsymbol{\mu}_{c}^{T} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{c}+\log \pi_{c} \\ \boldsymbol{\beta}_{c} &=\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{c} \end{aligned}$

Then we can write the posterior as:
$p(y=c | \mathbf{x}, \boldsymbol{\theta})=\frac{e^{\boldsymbol{\beta}_{c}^{T} \mathbf{x}+\gamma_{c}}}{\sum_{c^{\prime}} e^{\boldsymbol{\beta}_{c^{\prime}}^{T} \mathbf{x}+\gamma_{c^{\prime}}}}$

This is exactly the softmax function. Hence LDA is essentially a single hidden layer feed forward NN without any activation function. It is called linear discriminant analysis since the quadratic term gets cancelled and there is only a linear decision boundary between the classes ( ${\boldsymbol{\beta}_{c}^{T} \mathbf{x}+\gamma_{c}}$ ). This is called multi-class logistic regression,or multinomial logistic regression (also called maximum entropy classifier in NLP).

The difference between Binary Logistic regression (BLR) vs Linear Discriminant analysis (with 2 groups: also known as Fisher’s LDA):
Logistic regression vs. LDA as two-class classifiers - Cross Validated

2-class LDA tries to maximize the distance between the means of the two categories while simultaneously minimizing the scatter (or std dev) within each categor. Watch YouTube Video Explanation. Note that in contrast, PCA tries to find the directions which accounts for the maximum variation in the data (without having access to the class labels).

2-class LDA procedure:

Find $\mathbf{x}_0$ which is the centre of the line joining the means
For a new data point $\mathbf{x}$ , project it on to the line ( $\mathbf{w}^{T} \mathbf{x}$ ) and check if it is on the right or left of $\mathbf{x_0}$ , (by looking at the sign of $\mathbf{w}^{T} (\mathbf{x}-\mathbf{x}_{0})$ ) and make the classification.

Inference in jointly Gaussian distributions

Given a joint distribution, $p\left(\mathbf{x}_{1}, \mathbf{x}_{2}\right)$ , it is useful to be able to compute marginals $p\left(\mathbf{x}_{1}\right)$ and conditionals $p\left(\mathbf{x}_{1}| \mathbf{x}_{2}\right)$ .

$\boldsymbol{\mu}=\left( \begin{array}{c}{\boldsymbol{\mu}_{1}} \\ {\boldsymbol{\mu}_{2}}\end{array}\right), \quad \boldsymbol{\Sigma}=\left( \begin{array}{cc}{\boldsymbol{\Sigma}_{11}} & {\boldsymbol{\Sigma}_{12}} \\ {\boldsymbol{\Sigma}_{21}} & {\boldsymbol{\Sigma}_{22}}\end{array}\right)$

Then the marginals are given by:

$\begin{aligned} p\left(\mathbf{x}_{1}\right) &=\mathcal{N}\left(\mathbf{x}_{1} | \boldsymbol{\mu}_{1}, \boldsymbol{\Sigma}_{11}\right) \\ p\left(\mathbf{x}_{2}\right) &=\mathcal{N}\left(\mathbf{x}_{2} | \boldsymbol{\mu}_{2}, \boldsymbol{\Sigma}_{22}\right) \end{aligned}$

i.e., simply compute the the mean and covariance of each of the random variables separately.

and the posterior conditional is given by:
$p\left(\mathbf{x}_{1} | \mathbf{x}_{2}\right)=\mathcal{N}\left(\mathbf{x}_{1}\left|\boldsymbol{\mu}_{1|2}, \boldsymbol{\Sigma}_{1 | 2}\right)\right.$
where,
$\begin{aligned} \boldsymbol{\mu}_{1|2} &=\boldsymbol{\mu}_{1}+\boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1}\left(\mathbf{x}_{2}-\boldsymbol{\mu}_{2}\right) \\ &=\boldsymbol{\mu}_{1}-\mathbf{\Lambda}_{11}^{-1} \mathbf{\Lambda}_{12}\left(\mathbf{x}_{2}-\boldsymbol{\mu}_{2}\right) \\ &=\boldsymbol{\Sigma}_{1 | 2}\left(\boldsymbol{\Lambda}_{11} \boldsymbol{\mu}_{1}-\boldsymbol{\Lambda}_{12}\left(\mathbf{x}_{2}-\boldsymbol{\mu}_{2}\right)\right) \end{aligned}$
$\boldsymbol{\Sigma}_{1 | 2}=\boldsymbol{\Sigma}_{11}-\boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21}=\mathbf{\Lambda}_{11}^{-1}$

the conditional mean is just a linear function of $\mathbf{x}_{2}$ , and the conditional covariance is just a constant matrix that is independent of $\mathbf{x}_{2}$ .

Note that the marginals and conditionals are also Gaussians.

Linear Gaussian Systems

Suppose we have two variables, $\mathbf{x}$ and $\mathbf{y}$ .
Let $\mathbf{x} \in \mathbb{R}^{D_{x}}$ be a hidden variable (unobservable), and $\mathbf{y} \in \mathbb{R}^{D_{y}}$ be a noisy observation of $\mathbf{x}$ .

Let us assume we have the following prior and likelihood:

$\begin{aligned} p(\mathbf{x}) &=\mathcal{N}\left(\mathbf{x} | \boldsymbol{\mu}_{x}, \boldsymbol{\Sigma}_{x}\right) \\ p(\mathbf{y} | \mathbf{x}) &=\mathcal{N}\left(\mathbf{y} | \mathbf{A} \mathbf{x}+\mathbf{b}, \boldsymbol{\Sigma}_{y}\right) \end{aligned}$

where $\mathbf{A}$ is a matrix of size $D_{y} \times D_{x}$ .
The task that we are interested is how can we infer $\mathbf{x}$ given $\mathbf{y}$ .

One can show that the posterior $p(\mathbf{x} | \mathbf{y})$ is given by:
$\begin{aligned} p(\mathbf{x} | \mathbf{y}) &=\mathcal{N}\left(\mathbf{x}\left|\boldsymbol{\mu}_{x|y}, \boldsymbol{\Sigma}_{x | y}\right)\right.\\ \boldsymbol{\Sigma}_{x|y}^{-1} &=\boldsymbol{\Sigma}_{x}^{-1}+\mathbf{A}^{T} \boldsymbol{\Sigma}_{y}^{-1} \mathbf{A} \\ \boldsymbol{\mu}_{x|y} &=\boldsymbol{\Sigma}_{x | y}\left[\mathbf{A}^{T} \boldsymbol{\Sigma}_{y}^{-1}(\mathbf{y}-\mathbf{b})+\boldsymbol{\Sigma}_{x}^{-1} \boldsymbol{\mu}_{x}\right] \end{aligned}$

The above result is useful when one need to infer an unknown scalar/vector from noisy measurements.