Mixture Models
Mixture Models
A statistical model to represent a general multimodal PDF. Assumes that the data is drawn from multiple subgroups (mixture). That is, we model distributions as convex combination of standard distributions (Gaussian, for example).
\[\begin{align*} p(x) &= \sum_{k=1}^K w_kp_k(x) \\ \sum_{k=1}^K w_k &= 1 \end{align*}\]The weights are positive, and sum up to one to ensure that the net integral of the summation is one as well.
- Finite Mixture Model - $K$ is finite
- Gaussian Mixture Model - Each PDF is a gaussian distribution. A GMM can model uni modal, multi modal, and curved distributions
Fitting GMM to data - Maximum Likelihood
Given the data $y={y_n}$, we try to fit a GMM of $k$ components to this data.
\[P(x) := \sum_{k=1}^K w_k G(x;\mu_k, C_k)\]The parameters involved here are $\theta = {w_k, \mu_k, C_k}_{k=1}^K$. The likelihood, and the corresponding log-likelihood function would be given by the following equation. Notice that we cannot get a closed form solution because of the summation. Gradient descent converges quite slowly, so we introduce a new strategy for dealing with this.
\[\begin{align*} LL(\theta\vert y) = \sum_{n=1}^N\log\left( \sum_{k=1}^K w_k G(x;\mu_k, C_k) \right) \end{align*}\]Expectation Maximization (EM) Optimization
This is an iterative algorithm where a closed form solution is obtained in each iteration. It introduces the notion of a “hidden” random variable $X_n$ which we assume was missed when obtaining the data. Therefore, the complete data would be given by ${(x_n, y_n)}$.
We then marginalize over this introduced hidden variable.
\[\max_{\theta}P(y\vert\theta) = \max_\theta\int_xP(y,x\vert\theta)dx\]Let the parameters at iteration $i$ be $\theta_i$.
E Step
Design a new function of parameters $F(\theta ; \theta_i)$ such that
- $F$ is a lower bound for the Log-Likelihood function
- $F$ touches the log likelihood function at current estimate $\theta_i$
M Step
Choose the new parameter estimates $\theta_{i+1}$ to be where the previously defined function $F$ has a maximum. (Or at least, greater than the value in the previous iteration)
using entropy and KL in EM
The algorithm is terminated when the change in $\theta$ falls below a user-defined threshold. (Computing the entropy part is computationally intensive)
GMM Using EM
Let the hidden data be modeled by the random variable $Z={z_n}$.
Q(theta,theta_i) entropy and KL stuff