We begin by giving some intuition about mixture models. Consider an unobserved (or latent) discrete random variable taking $k$ states $s$ with probabilities $\text{Pr}(s=i)=\pi_i$ for $1\leq i \leq k$, and $k$ random variables $x_i|s_i$ with arbritary densities or distributions, which are *conditionally independent* of each other given the state of $s$. In the finite mixture model, we model the probability or density for a single point $x$ begin generated by the weighted mixture of the $x_i|s_i$

which is simply the marginalisation over the latent variable $s$. Note that $\sum_{i=1}^k\pi_i=1$.

For example, for the Gaussian mixture model (GMM), we get (adding a collection of parameters $\theta:=\{\boldsymbol{\mu}_i, \Sigma_i\}_{i=1}^k$ that contains $k$ mean and covariance parameters of single Gaussian distributions)

$$ p(x|\theta)=\sum_{i=1}^k \pi_i \mathcal{N}(\boldsymbol{\mu}_i,\Sigma_i) $$Note that *any* set of probability distributions on the same domain can be combined to such a mixture model. Note again that $s$ is an unobserved discrete random variable, i.e. we model data being generated from some weighted combination of baseline distributions. Interesting problems now are

- Learning the weights $\text{Pr}(s=i)=\pi_i$ from data
- Learning the parameters $\theta$ from data for a fixed family of $x_i|s_i$, for example for the GMM
- Using the learned model (which is a density estimate) for clustering or classification

All of these problems are in the context of *unsupervised learning* since the algorithm only sees the plain data and no information on its structure.

Expectation Maximisation (EM) is a powerful method to learn any form of latent models and can be applied to the Gaussian mixture model case. Standard methods such as Maximum Likelihood are not straightforward for latent models in general, while EM can almost always be applied. However, it might converge to local optima and does not guarantee globally optimal solutions (this can be dealt with with some tricks as we will see later). While the general idea in EM stays the same for all models it can be used on, the individual steps depend on the particular model that is being used.

The basic idea in EM is to maximise a lower bound, typically called the *free energy*, on the log-likelihood of the model. It does so by repeatedly performing two steps

- The E-step optimises the free energy with respect to the latent variables $s_i$, holding the parameters $\theta$ fixed. This is done via setting the distribution over $s$ to the posterior given the used observations.
- The M-step optimises the free energy with respect to the paramters $\theta$, holding the distribution over the $s_i$ fixed. This is done via maximum likelihood.

It can be shown that this procedure never decreases the likelihood and that stationary points (i.e. neither E-step nor M-step produce changes) of it corresponds to local maxima in the model's likelihood. See references for more details on the procedure, and how to obtain a lower bound on the log-likelihood. There exist many different flavours of EM, including variants where only subsets of the model are iterated over at a time. There is no learning rate such as step size or similar, which is good and bad since convergence can be slow.

The main class for GMM in Shogun is CGMM, which contains an interface for setting up a model and sampling from it, but also to learn the model (the $\pi_i$ and parameters $\theta$) via EM. It inherits from the base class for distributions in Shogun, CDistribution, and combines multiple single distribution instances to a mixture.

We start by creating a GMM instance, sampling from it, and computing the log-likelihood of the model for some points, and the log-likelihood of each individual component for some points. All these things are done in two dimensions to be able to plot them, but they generalise to higher (or lower) dimensions easily.

Let's sample, and illustrate the difference of knowing the latent variable indicating the component or not.

In [1]:

```
%pylab inline
%matplotlib inline
# import all Shogun classes
from modshogun import *
```

In [2]:

```
from matplotlib.patches import Ellipse
# a tool for visualisation
def get_gaussian_ellipse_artist(mean, cov, nstd=1.96, color="red", linewidth=3):
"""
Returns an ellipse artist for nstd times the standard deviation of this
Gaussian, specified by mean and covariance
"""
# compute eigenvalues (ordered)
vals, vecs = eigh(cov)
order = vals.argsort()[::-1]
vals, vecs = vals[order], vecs[:, order]
theta = numpy.degrees(arctan2(*vecs[:, 0][::-1]))
# width and height are "full" widths, not radius
width, height = 2 * nstd * sqrt(vals)
e = Ellipse(xy=mean, width=width, height=height, angle=theta, \
edgecolor=color, fill=False, linewidth=linewidth)
return e
```

Set up the model in Shogun

In [3]:

```
# create mixture of three Gaussians
num_components=3
num_max_samples=100
gmm=GMM(num_components)
dimension=2
# set means (TODO interface should be to construct mixture from individuals with set parameters)
means=zeros((num_components, dimension))
means[0]=[-5.0, -4.0]
means[1]=[7.0, 3.0]
means[2]=[0, 0.]
[gmm.set_nth_mean(means[i], i) for i in range(num_components)]
# set covariances
covs=zeros((num_components, dimension, dimension))
covs[0]=array([[2, 1.3],[.6, 3]])
covs[1]=array([[1.3, -0.8],[-0.8, 1.3]])
covs[2]=array([[2.5, .8],[0.8, 2.5]])
[gmm.set_nth_cov(covs[i],i) for i in range(num_components)]
# set mixture coefficients, these have to sum to one (TODO these should be initialised automatically)
weights=array([0.5, 0.3, 0.2])
gmm.set_coef(weights)
```

In [4]:

```
# now sample from each component seperately first, the from the joint model
hold(True)
colors=["red", "green", "blue"]
for i in range(num_components):
# draw a number of samples from current component and plot
num_samples=int(rand()*num_max_samples)+1
# emulate sampling from one component (TODO fix interface of GMM to handle this)
w=zeros(num_components)
w[i]=1.
gmm.set_coef(w)
# sample and plot (TODO fix interface to have loop within)
X=array([gmm.sample() for _ in range(num_samples)])
plot(X[:,0], X[:,1], "o", color=colors[i])
# draw 95% elipsoid for current component
gca().add_artist(get_gaussian_ellipse_artist(means[i], covs[i], color=colors[i]))
hold(False)
_=title("%dD Gaussian Mixture Model with %d components" % (dimension, num_components))
# since we used a hack to sample from each component
gmm.set_coef(weights)
```

In [5]:

```
# generate a grid over the full space and evaluate components PDF
resolution=100
Xs=linspace(-10,10, resolution)
Ys=linspace(-8,6, resolution)
pairs=asarray([(x,y) for x in Xs for y in Ys])
D=asarray([gmm.cluster(pairs[i])[3] for i in range(len(pairs))]).reshape(resolution,resolution)
figure(figsize=(18,5))
subplot(1,2,1)
pcolor(Xs,Ys,D)
xlim([-10,10])
ylim([-8,6])
title("Log-Likelihood of GMM")
subplot(1,2,2)
pcolor(Xs,Ys,exp(D))
xlim([-10,10])
ylim([-8,6])
_=title("Likelihood of GMM")
```

In [6]:

```
# sample and plot (TODO fix interface to have loop within)
X=array([gmm.sample() for _ in range(num_max_samples)])
plot(X[:,0], X[:,1], "o")
_=title("Samples from GMM")
```

In [7]:

```
def estimate_gmm(X, num_components):
# bring data into shogun representation (note that Shogun data is in column vector form, so transpose)
features=RealFeatures(X.T)
gmm_est=GMM(num_components)
gmm_est.set_features(features)
# learn GMM
gmm_est.train_em()
return gmm_est
```

So far so good, now lets plot the density of this GMM using the code from above

In [8]:

```
component_numbers=[2,3]
# plot true likelihood
D_true=asarray([gmm.cluster(pairs[i])[num_components] for i in range(len(pairs))]).reshape(resolution,resolution)
figure(figsize=(18,5))
subplot(1,len(component_numbers)+1,1)
pcolor(Xs,Ys,exp(D_true))
xlim([-10,10])
ylim([-8,6])
title("True likelihood")
for n in range(len(component_numbers)):
# TODO get rid of these hacks and offer nice interface from Shogun
# learn GMM with EM
gmm_est=estimate_gmm(X, component_numbers[n])
# evaluate at a grid of points
D_est=asarray([gmm_est.cluster(pairs[i])[component_numbers[n]] for i in range(len(pairs))]).reshape(resolution,resolution)
# visualise densities
subplot(1,len(component_numbers)+1,n+2)
pcolor(Xs,Ys,exp(D_est))
xlim([-10,10])
ylim([-8,6])
_=title("Estimated likelihood for EM with %d components"%component_numbers[n])
```

It seems that three comonents give a density that is closest to the original one. While two components also do a reasonable job here, it might sometimes happen (KMeans is used to initialise the cluster centres if not done by hand, using a random cluster initialisation) that the upper two Gaussians are grouped, re-run for a couple of times to see this. This illustrates how EM might get stuck in a local minimum. We will do this below, where it might well happen that all runs produce the same or different results - no guarantees. Note that it is easily possible to initialise EM via specifying the parameters of the mixture components as did to create the original model above.

One way to decide which of multiple convergenced EM instances to use is to simply compute many of them (with different initialisations) and then choose the one with the largest likelihood. *WARNING* Do not select the number of components like this as the model will overfit.

In [9]:

```
# function to draw ellipses for all components of a GMM
def visualise_gmm(gmm, color="blue"):
for i in range(gmm.get_num_components()):
component=Gaussian.obtain_from_generic(gmm.get_component(i))
gca().add_artist(get_gaussian_ellipse_artist(component.get_mean(), component.get_cov(), color=color))
```

In [10]:

```
# multiple runs to illustrate random initialisation matters
for _ in range(3):
figure(figsize=(18,5))
subplot(1, len(component_numbers)+1, 1)
plot(X[:,0],X[:,1], 'o')
visualise_gmm(gmm_est, color="blue")
title("True components")
for i in range(len(component_numbers)):
gmm_est=estimate_gmm(X, component_numbers[i])
subplot(1, len(component_numbers)+1, i+2)
plot(X[:,0],X[:,1], 'o')
visualise_gmm(gmm_est, color=colors[i])
# TODO add a method to get likelihood of full model, retraining is inefficient
likelihood=gmm_est.train_em()
_=title("Estimated likelihood: %.2f (%d components)"%(likelihood,component_numbers[i]))
```

In [11]:

```
def cluster_and_visualise(gmm_est):
# obtain cluster index for each point of the training data
# TODO another hack here: Shogun should allow to pass multiple points and only return the index
# as the likelihood can be done via the individual components
# In addition, argmax should be computed for us, although log-pdf for all components should also be possible
clusters=asarray([argmax(gmm_est.cluster(x)[:gmm.get_num_components()]) for x in X])
# visualise points by cluster
hold(True)
for i in range(gmm.get_num_components()):
indices=clusters==i
plot(X[indices,0],X[indices,1], 'o', color=colors[i])
hold(False)
# learn gmm again
gmm_est=estimate_gmm(X, num_components)
figure(figsize=(18,5))
subplot(121)
cluster_and_visualise(gmm)
title("Clustering under true GMM")
subplot(122)
cluster_and_visualise(gmm_est)
_=title("Clustering under estimated GMM")
```

These are clusterings obtained via the true mixture model and the one learned via EM. There is a slight subtlety here: even the model under which the data was generated will not cluster the data correctly if the data is overlapping. This is due to the fact that the cluster with the largest probability is chosen. This doesn't allow for any ambiguity. If you are interested in cases where data overlaps, you should always look at the log-likelihood of the point for each cluster and consider taking into acount "draws" in the decision, i.e. probabilities for two different clusters are equally large.

Below we plot all points, coloured by their likelihood under each component.

In [12]:

```
figure(figsize=(18,5))
for comp_idx in range(num_components):
subplot(1,num_components,comp_idx+1)
# evaluated likelihood under current component
# TODO Shogun should do the loop and allow to specify component indices to evaluate pdf for
# TODO distribution interface should be the same everywhere
component=Gaussian.obtain_from_generic(gmm.get_component(comp_idx))
cluster_likelihoods=asarray([component.compute_PDF(X[i]) for i in range(len(X))])
# normalise
cluster_likelihoods-=cluster_likelihoods.min()
cluster_likelihoods/=cluster_likelihoods.max()
# plot, coloured by likelihood value
cm=get_cmap("jet")
hold(True)
for j in range(len(X)):
color = cm(cluster_likelihoods[j])
plot(X[j,0], X[j,1] ,"o", color=color)
hold(False)
title("Data coloured by likelihood for component %d" % comp_idx)
```

Note how the lower left and middle cluster are overlapping in the sense that points at their intersection have similar likelihoods. If you do not care at all about this and are just interested in a partitioning of the space, simply choose the maximum.

Below we plot the space partitioning for a hard clustering.

In [13]:

```
# compute cluster index for every point in space
D_est=asarray([gmm_est.cluster(pairs[i])[:num_components].argmax() for i in range(len(pairs))]).reshape(resolution,resolution)
# visualise clustering
cluster_and_visualise(gmm_est)
# visualise space partitioning
hold(True)
pcolor(Xs,Ys,D_est)
hold(False)
```

- Split and merge Expectation Maximisation

TODO (bugfixes, interface issues, see also all the TODOs in notebook)

- Check cholesky computation
- likelhoods should sum to one
- sample multiple points
- rename cluster method to compute_log_likelihoods
- pull model likelihood out of sample
proper mixture model interface with classical distribution and EM interface (E,M methods)

set_coef dimensiosn check. DImension check in general

- check for initialization before sampling
- eigen3ize Gaussian and remove GaussianDistribution

[1]: Barber, D. (2012). Bayesian Reasoning and Machine Learning. Cambridge University Press.

[2]: Bishop, C. M., & others. (2006). Pattern recognition and machine learning. Springer New York.

[3]: Gatsby Computational Neurocience Unit lecture notes for Machine Learning (link)