Chapter 4 - Gaussian Models

μ^mle=1Ni=1NxixΣ^mle=1Ni=1N(xix)(xix)T=1N(i=1NxixiT)xxT\begin{aligned} \hat{\boldsymbol{\mu}}_{m l e} &=\frac{1}{N} \sum_{i=1}^{N} \mathbf{x}_{i} \triangleq \overline{\mathbf{x}} \\ \hat{\mathbf{\Sigma}}_{m l e} &=\frac{1}{N} \sum_{i=1}^{N}\left(\mathbf{x}_{i}-\overline{\mathbf{x}}\right)\left(\mathbf{x}_{i}-\overline{\mathbf{x}}\right)^{T}=\frac{1}{N}\left(\sum_{i=1}^{N} \mathbf{x}_{i} \mathbf{x}_{i}^{T}\right)-\overline{\mathbf{x}} \overline{\mathbf{x}}^{T} \end{aligned}


Gaussian Discriminant Analysis


Quadratic Discriminant Analysis

From Bayes rule,
p(y=cx,θ)logp(y=cπ)×logp(xθc)p(y=c | \mathbf{x}, \boldsymbol{\theta}) \propto \log p(y=c | \boldsymbol{\pi})\times\log p\left(\mathbf{x} | \boldsymbol{\theta}_{c}\right)

Assuming prior to be Cat(πc)Cat(\pi_c) and Gaussian class conditionals,
p(y=cx,θ)πc2πΣc12exp[12(xμc)TΣc1(xμc)]p(y=c | \mathbf{x}, \boldsymbol{\theta}) \propto \pi_{c}\left|2 \pi \boldsymbol{\Sigma}_{c}\right|^{-\frac{1}{2}} \exp \left[-\frac{1}{2}\left(\mathbf{x}-\boldsymbol{\mu}_{c}\right)^{T} \boldsymbol{\Sigma}_{c}^{-1}\left(\mathbf{x}-\boldsymbol{\mu}_{c}\right)\right]

Linear Discriminant Analysis

Note that in the above case, each Gaussian is assumed to have its own mean μc\boldsymbol{\mu}_{c} and covariance Σc\boldsymbol{\Sigma}_{c}

We can simplify this by making an assumption that the covariance matrices are the same across classes.

p(y=cx,θ)πcexp[μcTΣ1x12xTΣ1x12μcTΣ1μc]=exp[μcTΣ1x12μcTΣ1μc+logπc]exp[12xTΣ1x]\begin{aligned} p(y=c | \mathbf{x}, \boldsymbol{\theta}) & \propto \pi_{c} \exp \left[\boldsymbol{\mu}_{c}^{T} \mathbf{\Sigma}^{-1} \mathbf{x}-\frac{1}{2} \mathbf{x}^{T} \boldsymbol{\Sigma}^{-1} \mathbf{x}-\frac{1}{2} \boldsymbol{\mu}_{c}^{T} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{c}\right] \\ &=\exp \left[\boldsymbol{\mu}_{c}^{T} \boldsymbol{\Sigma}^{-1} \mathbf{x}-\frac{1}{2} \boldsymbol{\mu}_{c}^{T} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{c}+\log \pi_{c}\right] \exp \left[-\frac{1}{2} \mathbf{x}^{T} \boldsymbol{\Sigma}^{-1} \mathbf{x}\right] \end{aligned}

Note that the second exp\exp term (xTΣ1x\mathbf{x}^{T} \boldsymbol{\Sigma}^{-1} \mathbf{x} - the quadratic term) is independent of the class cc. Hence, in the bayes formula, it gets cancelled out from the numerator and denominator.

Let’s define βc\boldsymbol{\beta}_{c} to be the coefficient of x\mathbf{x} and γc\gamma_{c} to be the constant term:
γc=12μcTΣ1μc+logπcβc=Σ1μc \begin{aligned} \gamma_{c} &=-\frac{1}{2} \boldsymbol{\mu}_{c}^{T} \boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{c}+\log \pi_{c} \\ \boldsymbol{\beta}_{c} &=\boldsymbol{\Sigma}^{-1} \boldsymbol{\mu}_{c} \end{aligned}

Then we can write the posterior as:
p(y=cx,θ)=eβcTx+γcceβcTx+γcp(y=c | \mathbf{x}, \boldsymbol{\theta})=\frac{e^{\boldsymbol{\beta}_{c}^{T} \mathbf{x}+\gamma_{c}}}{\sum_{c^{\prime}} e^{\boldsymbol{\beta}_{c^{\prime}}^{T} \mathbf{x}+\gamma_{c^{\prime}}}}

This is exactly the softmax function. Hence LDA is essentially a single hidden layer feed forward NN without any activation function. It is called linear discriminant analysis since the quadratic term gets cancelled and there is only a linear decision boundary between the classes (βcTx+γc{\boldsymbol{\beta}_{c}^{T} \mathbf{x}+\gamma_{c}}). This is called multi-class logistic regression,or multinomial logistic regression (also called maximum entropy classifier in NLP).

The difference between Binary Logistic regression (BLR) vs Linear Discriminant analysis (with 2 groups: also known as Fisher’s LDA):
Logistic regression vs. LDA as two-class classifiers - Cross Validated

2-class LDA tries to maximize the distance between the means of the two categories while simultaneously minimizing the scatter (or std dev) within each categor. Watch YouTube Video Explanation. Note that in contrast, PCA tries to find the directions which accounts for the maximum variation in the data (without having access to the class labels).

2-class LDA procedure:

lda-2-class.png


Inference in jointly Gaussian distributions

Given a joint distribution, p(x1,x2)p\left(\mathbf{x}_{1}, \mathbf{x}_{2}\right), it is useful to be able to compute marginals p(x1)p\left(\mathbf{x}_{1}\right) and conditionals p(x1x2)p\left(\mathbf{x}_{1}| \mathbf{x}_{2}\right).

μ=(μ1μ2),Σ=(Σ11Σ12Σ21Σ22)\boldsymbol{\mu}=\left( \begin{array}{c}{\boldsymbol{\mu}_{1}} \\ {\boldsymbol{\mu}_{2}}\end{array}\right), \quad \boldsymbol{\Sigma}=\left( \begin{array}{cc}{\boldsymbol{\Sigma}_{11}} & {\boldsymbol{\Sigma}_{12}} \\ {\boldsymbol{\Sigma}_{21}} & {\boldsymbol{\Sigma}_{22}}\end{array}\right)

Then the marginals are given by:

p(x1)=N(x1μ1,Σ11)p(x2)=N(x2μ2,Σ22)\begin{aligned} p\left(\mathbf{x}_{1}\right) &=\mathcal{N}\left(\mathbf{x}_{1} | \boldsymbol{\mu}_{1}, \boldsymbol{\Sigma}_{11}\right) \\ p\left(\mathbf{x}_{2}\right) &=\mathcal{N}\left(\mathbf{x}_{2} | \boldsymbol{\mu}_{2}, \boldsymbol{\Sigma}_{22}\right) \end{aligned}

i.e., simply compute the the mean and covariance of each of the random variables separately.

and the posterior conditional is given by:
p(x1x2)=N(x1μ12,Σ12)p\left(\mathbf{x}_{1} | \mathbf{x}_{2}\right)=\mathcal{N}\left(\mathbf{x}_{1}\left|\boldsymbol{\mu}_{1|2}, \boldsymbol{\Sigma}_{1 | 2}\right)\right.
where,
μ12=μ1+Σ12Σ221(x2μ2)=μ1Λ111Λ12(x2μ2)=Σ12(Λ11μ1Λ12(x2μ2))\begin{aligned} \boldsymbol{\mu}_{1|2} &=\boldsymbol{\mu}_{1}+\boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1}\left(\mathbf{x}_{2}-\boldsymbol{\mu}_{2}\right) \\ &=\boldsymbol{\mu}_{1}-\mathbf{\Lambda}_{11}^{-1} \mathbf{\Lambda}_{12}\left(\mathbf{x}_{2}-\boldsymbol{\mu}_{2}\right) \\ &=\boldsymbol{\Sigma}_{1 | 2}\left(\boldsymbol{\Lambda}_{11} \boldsymbol{\mu}_{1}-\boldsymbol{\Lambda}_{12}\left(\mathbf{x}_{2}-\boldsymbol{\mu}_{2}\right)\right) \end{aligned}
Σ12=Σ11Σ12Σ221Σ21=Λ111\boldsymbol{\Sigma}_{1 | 2}=\boldsymbol{\Sigma}_{11}-\boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21}=\mathbf{\Lambda}_{11}^{-1}

the conditional mean is just a linear function of x2\mathbf{x}_{2}, and the conditional covariance is just a constant matrix that is independent of x2\mathbf{x}_{2}.

Note that the marginals and conditionals are also Gaussians.


Linear Gaussian Systems

Suppose we have two variables, x\mathbf{x} and y\mathbf{y}.
Let xRDx\mathbf{x} \in \mathbb{R}^{D_{x}} be a hidden variable (unobservable), and yRDy\mathbf{y} \in \mathbb{R}^{D_{y}} be a noisy observation of x\mathbf{x}.

Let us assume we have the following prior and likelihood:

p(x)=N(xμx,Σx)p(yx)=N(yAx+b,Σy) \begin{aligned} p(\mathbf{x}) &=\mathcal{N}\left(\mathbf{x} | \boldsymbol{\mu}_{x}, \boldsymbol{\Sigma}_{x}\right) \\ p(\mathbf{y} | \mathbf{x}) &=\mathcal{N}\left(\mathbf{y} | \mathbf{A} \mathbf{x}+\mathbf{b}, \boldsymbol{\Sigma}_{y}\right) \end{aligned}

where A\mathbf{A} is a matrix of size Dy×DxD_{y} \times D_{x}.
The task that we are interested is how can we infer x\mathbf{x} given y\mathbf{y}.

One can show that the posterior p(xy)p(\mathbf{x} | \mathbf{y}) is given by:
p(xy)=N(xμxy,Σxy)Σxy1=Σx1+ATΣy1Aμxy=Σxy[ATΣy1(yb)+Σx1μx] \begin{aligned} p(\mathbf{x} | \mathbf{y}) &=\mathcal{N}\left(\mathbf{x}\left|\boldsymbol{\mu}_{x|y}, \boldsymbol{\Sigma}_{x | y}\right)\right.\\ \boldsymbol{\Sigma}_{x|y}^{-1} &=\boldsymbol{\Sigma}_{x}^{-1}+\mathbf{A}^{T} \boldsymbol{\Sigma}_{y}^{-1} \mathbf{A} \\ \boldsymbol{\mu}_{x|y} &=\boldsymbol{\Sigma}_{x | y}\left[\mathbf{A}^{T} \boldsymbol{\Sigma}_{y}^{-1}(\mathbf{y}-\mathbf{b})+\boldsymbol{\Sigma}_{x}^{-1} \boldsymbol{\mu}_{x}\right] \end{aligned}

The above result is useful when one need to infer an unknown scalar/vector from noisy measurements.