Kernel hypothesis testing in Shogun

This notebook describes Shogun's framework for statistical hypothesis testing. We begin by giving a brief outline of the problem setting and then describe various implemented algorithms. All algorithms discussed here are instances of kernel two-sample testing with the maximum mean discrepancy, and are based on embedding probability distributions into Reproducing Kernel Hilbert Spaces (RKHS).

There are two types of tests available, a quadratic time test and a linear time test. Both come in various flavours.

In [1]:
%pylab inline
%matplotlib inline
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
import modshogun as sg
import numpy as np
Populating the interactive namespace from numpy and matplotlib

Some Formal Basics (skip if you just want code examples)

To set the context, we here briefly describe statistical hypothesis testing. Informally, one defines a hypothesis on a certain domain and then uses a statistical test to check whether this hypothesis is true. Formally, the goal is to reject a so-called null-hypothesis $H_0:p=q$, which is the complement of an alternative-hypothesis $H_A$.

To distinguish the hypotheses, a test statistic is computed on sample data. Since sample data is finite, this corresponds to sampling the true distribution of the test statistic. There are two different distributions of the test statistic -- one for each hypothesis. The null-distribution corresponds to test statistic samples under the model that $H_0$ holds; the alternative-distribution corresponds to test statistic samples under the model that $H_A$ holds.

In practice, one tries to compute the quantile of the test statistic in the null-distribution. In case the test statistic is in a high quantile, i.e. it is unlikely that the null-distribution has generated the test statistic -- the null-hypothesis $H_0$ is rejected.

There are two different kinds of errors in hypothesis testing:

  • A type I error is made when $H_0: p=q$ is wrongly rejected. That is, the test says that the samples are from different distributions when they are not.
  • A type II error is made when $H_A: p\neq q$ is wrongly accepted. That is, the test says that the samples are from the same distribution when they are not.

A so-called consistent test achieves zero type II error for a fixed type I error, as it sees more data.

To decide whether to reject $H_0$, one could set a threshold, say at the $95\%$ quantile of the null-distribution, and reject $H_0$ when the test statistic lies below that threshold. This means that the chance that the samples were generated under $H_0$ are $5\%$. We call this number the test power $\alpha$ (in this case $\alpha=0.05$). It is an upper bound on the probability for a type I error. An alternative way is simply to compute the quantile of the test statistic in the null-distribution, the so-called p-value, and to compare the p-value against a desired test power, say $\alpha=0.05$, by hand. The advantage of the second method is that one not only gets a binary answer, but also an upper bound on the type I error.

In order to construct a two-sample test, the null-distribution of the test statistic has to be approximated. One way of doing this is called the permutation test, where samples from both sources are mixed and permuted repeatedly and the test statistic is computed for every of those configurations. While this method works for every statistical hypothesis test, it might be very costly because the test statistic has to be re-computed many times. Shogun comes with an extremely optimized implementation though. For completeness, Shogun also includes a number of more sohpisticated ways of approximating the null distribution.

Base class for Hypothesis Testing

Shogun implements statistical testing in the abstract class CHypothesisTest. All implemented methods will work with this interface at their most basic level. We here focos on CTwoSampleTest. This class offers methods to

  • compute the implemented test statistic,
  • compute p-values for a given value of the test statistic,
  • compute a test threshold for a given p-value,
  • approximate the null distribution, e.g. perform the permutation test and
  • performing a full two-sample test, and either returning a p-value or a binary rejection decision. This method is most useful in practice. Note that, depending on the used test statistic.

Kernel Two-Sample Testing with the Maximum Mean Discrepancy

$\DeclareMathOperator{\mmd}{MMD}$ An important class of hypothesis tests are the two-sample tests. In two-sample testing, one tries to find out whether two sets of samples come from different distributions. Given two probability distributions $p,q$ on some arbritary domains $\mathcal{X}, \mathcal{Y}$ respectively, and i.i.d. samples $X=\{x_i\}_{i=1}^m\subseteq \mathcal{X}\sim p$ and $Y=\{y_i\}_{i=1}^n\subseteq \mathcal{Y}\sim p$, the two sample test distinguishes the hypothesises

\begin{align*} H_0: p=q\\ H_A: p\neq q \end{align*}

In order to solve this problem, it is desirable to have a criterion than takes a positive unique value if $p\neq q$, and zero if and only if $p=q$. The so called Maximum Mean Discrepancy (MMD), has this property and allows to distinguish any two probability distributions, if used in a reproducing kernel Hilbert space (RKHS). It is the distance of the mean embeddings $\mu_p, \mu_q$ of the distributions $p,q$ in such a RKHS $\mathcal{F}$ -- which can also be expressed in terms of expectation of kernel functions, i.e.

\begin{align*} \mmd[\mathcal{F},p,q]&=||\mu_p-\mu_q||_\mathcal{F}^2\\ &=\textbf{E}_{x,x'}\left[ k(x,x')\right]- 2\textbf{E}_{x,y}\left[ k(x,y)\right] +\textbf{E}_{y,y'}\left[ k(y,y')\right] \end{align*}

Note that this formulation does not assume any form of the input data, we just need a kernel function whose feature space is a RKHS, see [2, Section 2] for details. This has the consequence that in Shogun, we can do tests on any type of data (CDenseFeatures, CSparseFeatures, CStringFeatures, etc), as long as we or you provide a positive definite kernel function under the interface of CKernel.

We here only describe how to use the MMD for two-sample testing. Shogun offers two types of test statistic based on the MMD, one with quadratic costs both in time and space, and one with linear time and constant space costs. Both come in different versions and with different methods how to approximate the null-distribution in order to construct a two-sample test.

Running Example Data. Gaussian vs. Laplace

In order to illustrate kernel two-sample testing with Shogun, we use a couple of toy distributions. The first dataset we consider is the 1D Standard Gaussian


with mean $\mu$ and variance $\sigma^2$, which is compared against the 1D Laplace distribution


with the same mean $\mu$ and variance $2b^2$. In order to increase difficulty, we set $b=\sqrt{\frac{1}{2}}$, which means that $2b^2=\sigma^2=1$.

In [2]:
# use scipy for generating samples
from scipy.stats import laplace, norm

def sample_gaussian_vs_laplace(n=220, mu=0.0, sigma2=1, b=np.sqrt(0.5)):    
    # sample from both distributions
    Y=laplace.rvs(size=n, loc=mu, scale=b)
    return X,Y
In [3]:
X,Y=sample_gaussian_vs_laplace(n, mu, sigma2, b)

# plot both densities and histograms
plt.suptitle("Gaussian vs. Laplace")
Xs=np.linspace(-2, 2, 500)
plt.plot(Xs, norm.pdf(Xs, loc=mu, scale=sigma2))
plt.plot(Xs, laplace.pdf(Xs, loc=mu, scale=b))

plt.hist(X, alpha=0.5)
plt.legend(["Gaussian", "Laplace"])

Now how to compare these two sets of samples? Clearly, a t-test would be a bad idea since it basically compares mean and variance of $X$ and $Y$. But we set that to be equal. By chance, the estimates of these statistics might differ, but that is unlikely to be significant. Thus, we have to look at higher order statistics of the samples. In fact, kernel two-sample tests look at all (infinitely many) higher order moments.

In [4]:
print "Gaussian vs. Laplace"
print "Sample means: %.2f vs %.2f" % (np.mean(X), np.mean(Y))
print "Samples variances: %.2f vs %.2f" % (np.var(X), np.var(Y))
Gaussian vs. Laplace
Sample means: 0.07 vs -0.00
Samples variances: 0.79 vs 0.82

Quadratic Time MMD

We now describe the quadratic time MMD, as described in [1, Lemma 6], which is implemented in Shogun. All methods in this section are implemented in CQuadraticTimeMMD, which accepts any type of features in Shogun, and use it on the above toy problem.

An unbiased estimate for the MMD expression above can be obtained by estimating expected values with averaging over independent samples

$$ \mmd_u[\mathcal{F},X,Y]^2=\frac{1}{m(m-1)}\sum_{i=1}^m\sum_{j\neq i}^mk(x_i,x_j) + \frac{1}{n(n-1)}\sum_{i=1}^n\sum_{j\neq i}^nk(y_i,y_j)-\frac{2}{mn}\sum_{i=1}^m\sum_{j\neq i}^nk(x_i,y_j) $$

A biased estimate would be

$$ \mmd_b[\mathcal{F},X,Y]^2=\frac{1}{m^2}\sum_{i=1}^m\sum_{j=1}^mk(x_i,x_j) + \frac{1}{n^ 2}\sum_{i=1}^n\sum_{j=1}^nk(y_i,y_j)-\frac{2}{mn}\sum_{i=1}^m\sum_{j\neq i}^nk(x_i,y_j) .$$

Computing the test statistic using CQuadraticTimeMMD does exactly this, where it is possible to choose between the two above expressions. Note that some methods for approximating the null-distribution only work with one of both types. Both statistics' computational costs are quadratic both in time and space. Note that the method returns $m\mmd_b[\mathcal{F},X,Y]^2$ since null distribution approximations work on $m$ times null distribution. Here is how the test statistic itself is computed.

In [5]:
# turn data into Shogun representation (columns vectors)

# choose kernel for testing. Here: Gaussian
kernel=sg.GaussianKernel(10, kernel_width)

# create mmd instance of test-statistic

# compute biased and unbiased test statistic (default is unbiased)


print "%d x MMD_b[X,Y]^2=%.2f" % (len(X), biased_statistic)
print "%d x MMD_u[X,Y]^2=%.2f" % (len(X), unbiased_statistic)
220 x MMD_b[X,Y]^2=0.92
220 x MMD_u[X,Y]^2=0.41

Any sub-class of CHypothesisTest can compute approximate the null distribution using permutation/bootstrapping. This way always is guaranteed to produce consistent results, however, it might take a long time as for each sample of the null distribution, the test statistic has to be computed for a different permutation of the data. Shogun's implementation is highly optimized, exploiting low-level CPU caching and multiple available cores.

In [6]:

# now show a couple of ways to compute the test

# compute p-value for computed test statistic
print "P-value of MMD value %.2f is %.2f" % (statistic, p_value)

# compute threshold for rejecting H_0 for a given test power
print "Threshold for rejecting H0 with a test power of %.2f is %.2f" % (alpha, threshold)

# performing the test by hand given the above results, note that those two are equivalent
if statistic>threshold:
    print "H0 is rejected with confidence %.2f" % alpha
if p_value<alpha:
    print "H0 is rejected with confidence %.2f" % alpha

# or, compute the full two-sample test directly
# fixed test power, binary decision
if binary_test_result:
    print "H0 is rejected with confidence %.2f" % alpha
P-value of MMD value 0.41 is 0.11
Threshold for rejecting H0 with a test power of 0.05 is 0.71

Now let us visualise distribution of MMD statistic under $H_0:p=q$ and $H_A:p\neq q$. Sample both null and alternative distribution for that. Use the interface of CHypothesisTest to sample from the null distribution (permutations, re-computing of test statistic is done internally). For the alternative distribution, compute the test statistic for a new sample set of $X$ and $Y$ in a loop. Note that the latter is expensive, as the kernel cannot be precomputed, and infinite data is needed. Though it is not needed in practice but only for illustrational purposes here.

In [7]:

# sample null distribution

# sample alternative distribution, generate new data for that
for i in range(num_samples):
    X=norm.rvs(size=n, loc=mu, scale=sigma2)
    Y=laplace.rvs(size=n, loc=mu, scale=b)
    feat_p=sg.RealFeatures(np.reshape(X, (1,len(X))))
    feat_q=sg.RealFeatures(np.reshape(Y, (1,len(Y))))
    # TODO: reset pre-computed kernel here


Null and Alternative Distribution Illustrated

Visualise both distributions, $H_0:p=q$ is rejected if a sample from the alternative distribution is larger than the $(1-\alpha)$-quantil of the null distribution. See [1] for more details on their forms. From the visualisations, we can read off the test's type I and type II error:

  • type I error is the area of the null distribution being right of the threshold
  • type II error is the area of the alternative distribution being left from the threshold
In [8]:
def plot_alt_vs_null(alt_samples, null_samples, alpha):
    plt.hist(null_samples, 50, color='blue')
    plt.title('Null distribution')
    plt.title('Alternative distribution')
    plt.hist(alt_samples, 50, color='green')
    plt.hist(null_samples, 50, color='blue')
    plt.hist(alt_samples, 50, color='green', alpha=0.5)
    plt.title('Null and alternative distriution')
    # find (1-alpha) element of null distribution
    plt.axvline(x=quantile, ymin=0, ymax=100, color='red', label=str(int(round((1-alpha)*100))) + '% quantile of null')
In [9]:
plot_alt_vs_null(alt_samples, null_samples, alpha)

Different Ways to Approximate the Null Distribution for the Quadratic Time MMD

As already mentioned, permuting the data to access the null distribution is probably the method of choice, due to the efficient implementation in Shogun. There exist a couple of methods that are more sophisticated (and slower) and either allow very fast approximations without guarantees or reasonably fast approximations that are consistent. We present a selection from [2], which are implemented in Shogun.

The first one is a spectral method that is based around the Eigenspectrum of the kernel matrix of the joint samples. It is faster than bootstrapping while being a consistent test. Effectively, the null-distribution of the biased statistic is sampled, but in a more efficient way than the bootstrapping approach. The converges as

$$ m\mmd^2_b \rightarrow \sum_{l=1}^\infty \lambda_l z_l^2 $$

where $z_l\sim \mathcal{N}(0,2)$ are i.i.d. normal samples and $\lambda_l$ are Eigenvalues of expression 2 in [2], which can be empirically estimated by $\hat\lambda_l=\frac{1}{m}\nu_l$ where $\nu_l$ are the Eigenvalues of the centred kernel matrix of the joint samples $X$ and $Y$. The distribution above can be easily sampled. Shogun's implementation has two parameters:

  • Number of samples from null-distribution. The more, the more accurate.
  • Number of Eigenvalues of the Eigen-decomposition of the kernel matrix to use. The more, the better the results get. However, the Eigen-spectrum of the joint gram matrix usually decreases very fast. Plotting the Spectrum can help. See [2] for details.

If the kernel matrices are diagonal dominant, this method is likely to fail. For that and more details, see the original paper. Computational costs are likely to be larger than permutation testing, due to the efficient implementation of the latter: Eigenvalues of the gram matrix cost $\mathcal{O}(m^3)$.

Below, we illustrate how to sample the null distribution and perform two-sample testing with the Spectrum approximation in the class CQuadraticTimeMMD. This method only works with the biased statistic.

In [10]:
# optional: plot spectrum of joint kernel matrix

# TODO: it would be good if there was a way to extract the joint kernel matrix for all kernel tests

# get joint feature object and compute kernel matrix and its spectrum
mmd.get_kernel().init(feats_p_q, feats_p_q)

# visualise K and its spectrum (only up to threshold)
plt.imshow(K, interpolation="nearest")
plt.title("Kernel matrix K of joint data $X$ and $Y$")
title("Eigenspectrum of K until component %d" % len(w[w>thresh]));

The above plot of the Eigenspectrum shows that the Eigenvalues are decaying extremely fast. We choose the number for the approximation such that all Eigenvalues bigger than some threshold are used. In this case, we will not loose a lot of accuracy while gaining a significant speedup. For slower decaying Eigenspectrums, this approximation might be more expensive.

In [11]:
# threshold for eigenspectrum

# compute number of eigenvalues to use

# finally, do the test, use biased statistic

#tell Shogun to use spectrum approximation

# the usual test interface
print "Spectrum: P-value of MMD test is %.2f" % p_value_spectrum

# compare with ground truth from permutation test
print "Bootstrapping: P-value of MMD test is %.2f" % p_value_permutation
Spectrum: P-value of MMD test is 0.05
Bootstrapping: P-value of MMD test is 0.05

The Gamma Moment Matching Approximation and Type I errors

$\DeclareMathOperator{\var}{var}$ Another method for approximating the null-distribution is by matching the first two moments of a Gamma distribution and then compute the quantiles of that. This does not result in a consistent test, but usually also gives good results while being very fast. However, there are distributions where the method fail. Therefore, the type I error should always be monitored. Described in [2]. It uses

$$ m\mmd_b(Z) \sim \frac{x^{\alpha-1}\exp(-\frac{x}{\beta})}{\beta^\alpha \Gamma(\alpha)} $$


$$ \alpha=\frac{(\textbf{E}(\text{MMD}_b(Z)))^2}{\var(\text{MMD}_b(Z))} \qquad \text{and} \qquad \beta=\frac{m \var(\text{MMD}_b(Z))}{(\textbf{E}(\text{MMD}_b(Z)))^2} $$

Then, any threshold and p-value can be computed using the gamma distribution in the above expression. Computational costs are in $\mathcal{O}(m^2)$. Note that the test is parameter free. It only works with the biased statistic.

In [12]:
# tell Shogun to use gamma approximation

# the usual test interface
print "Gamma: P-value of MMD test is %.2f" % p_value_gamma

# compare with ground truth bootstrapping
print "Bootstrapping: P-value of MMD test is %.2f" % p_value_spectrum
Gamma: P-value of MMD test is 0.35
Bootstrapping: P-value of MMD test is 0.07

As we can see, the above example was kind of unfortunate, as the approximation fails badly. We check the type I error to verify that. This works similar to sampling the alternative distribution: re-sample data (assuming infinite amounts), perform the test and average results. Below we compare type I errors or all methods for approximating the null distribution. This will take a while.

In [13]:
# type I error is false alarm, therefore sample data under H0
for i in range(num_trials):
    X=norm.rvs(size=n, loc=mu, scale=sigma2)
    Y=laplace.rvs(size=n, loc=mu, scale=b)
    # simulate H0 via merging samples before computing the 
    feat_p=sg.RealFeatures(reshape(X, (1,len(X))))
    feat_q=sg.RealFeatures(reshape(Y, (1,len(Y))))
    # gamma
    mmd=sg.QuadraticTimeMMD(feat_p, feat_q)
    # spectrum
    mmd=sg.QuadraticTimeMMD(feat_p, feat_q)
    # bootstrap (precompute kernel)
    mmd=sg.QuadraticTimeMMD(feat_p, feat_q)
    kernel.init(p_and_q, p_and_q)
In [14]:

print "Average rejection rate of H0 for Gamma is %.2f" % mean(convergence_gamma)
print "Average rejection rate of H0 for Spectrum is %.2f" % mean(convergence_spectrum)
print "Average rejection rate of H0 for Bootstrapping is %.2f" % mean(rejections_bootstrap)
Average rejection rate of H0 for Gamma is 0.00
Average rejection rate of H0 for Spectrum is 0.58
Average rejection rate of H0 for Bootstrapping is 0.56

We see that Gamma basically never rejects, which is inline with the fact that the p-value was massively overestimated above. Note that for the other tests, the p-value is also not at its desired value, but this is due to the low number of samples/repetitions in the above code. Increasing them leads to consistent type I errors.

Linear Time MMD on Gaussian Blobs

So far, we basically had to precompute the kernel matrix for reasonable runtimes. This is not possible for more than a few thousand points. The linear time MMD statistic, implemented in CLinearTimeMMD can help here, as it accepts data under the streaming interface CStreamingFeatures, which deliver data one-by-one.

And it can do more cool things, for example choose the best single (or combined) kernel for you. But we need a more fancy dataset for that to show its power. We will use one of Shogun's streaming based data generator, CGaussianBlobsDataGenerator for that. This dataset consists of two distributions which are a grid of Gaussians where in one of them, the Gaussians are stretched and rotated. This dataset is regarded as challenging for two-sample testing.

In [15]:
# paramters of dataset

# these are streaming features
gen_p=sg.GaussianBlobsDataGenerator(num_blobs, distance, 1, 0)
gen_q=sg.GaussianBlobsDataGenerator(num_blobs, distance, stretch, angle)
# stream some data and plot

plot(data[0][0:num_plot], data[1][0:num_plot], 'r.', label='$x$')
title('$X\sim p$')
plot(data[0][num_plot+1:2*num_plot], data[1][num_plot+1:2*num_plot], 'b.', label='$x$', alpha=0.5)
_=title('$Y\sim q$')

We now describe the linear time MMD, as described in [1, Section 6], which is implemented in Shogun. A fast, unbiased estimate for the original MMD expression which still uses all available data can be obtained by dividing data into two parts and then compute

$$ \mmd_l^2[\mathcal{F},X,Y]=\frac{1}{m_2}\sum_{i=1}^{m_2} k(x_{2i},x_{2i+1})+k(y_{2i},y_{2i+1})-k(x_{2i},y_{2i+1})- k(x_{2i+1},y_{2i}) $$

where $ m_2=\lfloor\frac{m}{2} \rfloor$. While the above expression assumes that $m$ data are available from each distribution, the statistic in general works in an online setting where features are obtained one by one. Since only pairs of four points are considered at once, this allows to compute it on data streams. In addition, the computational costs are linear in the number of samples that are considered from each distribution. These two properties make the linear time MMD very applicable for large scale two-sample tests. In theory, any number of samples can be processed -- time is the only limiting factor.

We begin by illustrating how to pass data to CLinearTimeMMD. In order not to loose performance due to overhead, it is possible to specify a block size for the data stream.

In [16]:

# if features are already under the streaming interface, just pass them
mmd=sg.LinearTimeMMD(gen_p, gen_q)

# compute an unbiased estimate in linear time
print "MMD_l[X,Y]^2=%.2f" % statistic

# note: due to the streaming nature, successive calls of compute statistic use different data
# and produce different results. Data cannot be stored in memory
for _ in range(5):
    print "MMD_l[X,Y]^2=%.2f" % mmd.compute_statistic()

Sometimes, one might want to use CLinearTimeMMD with data that is stored in memory. In that case, it is easy to data in the form of for example CStreamingDenseFeatures into CDenseFeatures.

In [17]:
# data source
gen_p=sg.GaussianBlobsDataGenerator(num_blobs, distance, 1, 0)
gen_q=sg.GaussianBlobsDataGenerator(num_blobs, distance, stretch, angle)

print "Number of data is %d" % num_samples

# retreive some points, store them as non-streaming data in memory

# example to create mmd (note that num_samples can be maximum the number of data in memory)

mmd=sg.LinearTimeMMD(data_p, data_q)
mmd.set_kernel(sg.GaussianKernel(10, 1))
print "Linear time MMD statistic: %.2f" % mmd.compute_statistic()
Number of data is 100
Linear time MMD statistic: 0.05

The Gaussian Approximation to the Null Distribution

As for any two-sample test in Shogun, bootstrapping can be used to approximate the null distribution. This results in a consistent, but slow test. The number of samples to take is the only parameter. Note that since CLinearTimeMMD operates on streaming features, new data is taken from the stream in every iteration.

Bootstrapping is not really necessary since there exists a fast and consistent estimate of the null-distribution. However, to ensure that any approximation is accurate, it should always be checked against bootstrapping at least once.

Since both the null- and the alternative distribution of the linear time MMD are Gaussian with equal variance (and different mean), it is possible to approximate the null-distribution by using a linear time estimate for this variance. An unbiased, linear time estimator for

$$ \var[\mmd_l^2[\mathcal{F},X,Y]] $$

can simply be computed by computing the empirical variance of

$$ k(x_{2i},x_{2i+1})+k(y_{2i},y_{2i+1})-k(x_{2i},y_{2i+1})-k(x_{2i+1},y_{2i}) \qquad (1\leq i\leq m_2) $$

A normal distribution with this variance and zero mean can then be used as an approximation for the null-distribution. This results in a consistent test and is very fast. However, note that it is an approximation and its accuracy depends on the underlying data distributions. It is a good idea to compare to the bootstrapping approach first to determine an appropriate number of samples to use. This number is usually in the tens of thousands.

CLinearTimeMMD allows to approximate the null distribution in the same pass as computing the statistic itself (in linear time). This should always be used in practice since seperate calls of computing statistic and p-value will operator on different data from the stream. Below, we compute the test on a large amount of data (impossible to perform quadratic time MMD for this one as the kernel matrices cannot be stored in memory)

In [18]:
mmd=sg.LinearTimeMMD(gen_p, gen_q)

print "m=%d samples from p and q" % m
print "Binary test result is: " + ("Rejection" if mmd.perform_test(alpha) else "No rejection")
print "P-value test result is %.2f" % mmd.compute_p_value(mmd.compute_statistic())
m=20000 samples from p and q
Binary test result is: Rejection
P-value test result is 0.00

Kernel Selection for the MMD -- Overview

$\DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max}$ Now which kernel do we actually use for our tests? So far, we just plugged in arbritary ones. However, for kernel two-sample testing, it is possible to do something more clever.

Shogun's kernel selection methods for MMD based two-sample tests are all based around [3, 4]. For the CLinearTimeMMD, [3] describes a way of selecting the optimal kernel in the sense that the test's type II error is minimised. For the linear time MMD, this is the method of choice. It is done via maximising the MMD statistic divided by its standard deviation and it is possible for single kernels and also for convex combinations of them. For the CQuadraticTimeMMD, the best method in literature is choosing the kernel that maximised the MMD statistic [4]. For convex combinations of kernels, this can be achieved via a $L2$ norm constraint. A detailed comparison of all methods on numerous datasets can be found in [5].

MMD Kernel selection in Shogun always involves coosing a one of the methods of CGaussianKernel All methods compute their results for a fixed set of these baseline kernels. We later give an example how to use these classes after providing a list of available methods.

  • KSM_MEDIAN_HEURISTIC: Selects from a set CGaussianKernel instances the one whose width parameter is closest to the median of the pairwise distances in the data. The median is computed on a certain number of points from each distribution that can be specified as a parameter. Since the median is a stable statistic, one does not have to compute all pairwise distances but rather just a few thousands. This method a useful (and fast) heuristic that in many cases gives a good hint on where to start looking for Gaussian kernel widths. It is for example described in [1]. Note that it may fail badly in selecting a good kernel for certain problems.

  • KSM_MAXIMIZE_MMD: Selects from a set of arbitrary baseline kernels a single one that maximises the used MMD statistic -- more specific its estimate. $$ k^*=\argmax_{k\in\mathcal{K}} \hat \eta_k, $$ where $\eta_k$ is an empirical MMD estimate for using a kernel $k$. This was first described in [4] and was empirically shown to perform better than the median heuristic above. However, it remains a heuristic that comes with no guarantees. Since MMD estimates can be computed in linear and quadratic time, this method works for both methods. However, for the linear time statistic, there exists a better method.

  • KSM_MAXIMIZE_POWER: Selects the optimal single kernel from a set of baseline kernels. This is done via maximising the ratio of the linear MMD statistic and its standard deviation. $$ k^*=\argmax_{k\in\mathcal{K}} \frac{\hat \eta_k}{\hat\sigma_k+\lambda}, $$ where $\eta_k$ is a linear time MMD estimate for using a kernel $k$ and $\hat\sigma_k$ is a linear time variance estimate of $\eta_k$ to which a small number $\lambda$ is added to prevent division by zero. These are estimated in a linear time way with the streaming framework that was described earlier. Therefore, this method is only available for CLinearTimeMMD. Optimal here means that the resulting test's type II error is minimised for a fixed type I error. Important: For this method to work, the kernel needs to be selected on different data than the test is performed on. Otherwise, the method will produce wrong results.

  • CMMDKernelSelectionCombMaxL2 Selects a convex combination of kernels that maximises the MMD statistic. This is the multiple kernel analogous to CMMDKernelSelectionMax. This is done via solving the convex program $$ \boldsymbol{\beta}^*=\min_{\boldsymbol{\beta}} \{\boldsymbol{\beta}^T\boldsymbol{\beta} : \boldsymbol{\beta}^T\boldsymbol{\eta}=\mathbf{1}, \boldsymbol{\beta}\succeq 0\}, $$ where $\boldsymbol{\beta}$ is a vector of the resulting kernel weights and $\boldsymbol{\eta}$ is a vector of which each component contains a MMD estimate for a baseline kernel. See [3] for details. Note that this method is unable to select a single kernel -- even when this would be optimal. Again, when using the linear time MMD, there are better methods available.

  • CMMDKernelSelectionCombOpt Selects a convex combination of kernels that maximises the MMD statistic divided by its covariance. This corresponds to \emph{optimal} kernel selection in the same sense as in class CMMDKernelSelectionOpt and is its multiple kernel analogous. The convex program to solve is $$ \boldsymbol{\beta}^*=\min_{\boldsymbol{\beta}} (\hat Q+\lambda I) \{\boldsymbol{\beta}^T\boldsymbol{\beta} : \boldsymbol{\beta}^T\boldsymbol{\eta}=\mathbf{1}, \boldsymbol{\beta}\succeq 0\}, $$ where again $\boldsymbol{\beta}$ is a vector of the resulting kernel weights and $\boldsymbol{\eta}$ is a vector of which each component contains a MMD estimate for a baseline kernel. The matrix $\hat Q$ is a linear time estimate of the covariance matrix of the vector $\boldsymbol{\eta}$ to whose diagonal a small number $\lambda$ is added to prevent division by zero. See [3] for details. In contrast to CMMDKernelSelectionCombMaxL2, this method is able to select a single kernel when this gives a lower type II error than a combination. In this sense, it contains CMMDKernelSelectionOpt.

MMD Kernel Selection in Shogun

In order to use one of the above methods for kernel selection, one has to create a new instance of CCombinedKernel append all desired baseline kernels to it. This combined kernel is then passed to the MMD class. Then, an object of any of the above kernel selection methods is created and the MMD instance is passed to it in the constructor. There are then multiple methods to call

  • compute_measures to compute a vector kernel selection criteria if a single kernel selection method is used. It will return a vector of selected kernel weights if a combined kernel selection method is used. For \shogunclass{CMMDKernelSelectionMedian}, the method does throw an error.

  • select_kernel returns the selected kernel of the method. For single kernels this will be one of the baseline kernel instances. For the combined kernel case, this will be the underlying CCombinedKernel instance where the subkernel weights are set to the weights that were selected by the method.

In order to utilise the selected kernel, it has to be passed to an MMD instance. We now give an example how to select the optimal single and combined kernel for the Gaussian Blobs dataset.

What is the best kernel to use here? This is tricky since the distinguishing characteristics are hidden at a small length-scale. Create some kernels to select the best from

In [19]:
# mmd instance using streaming features
mmd=sg.LinearTimeMMD(gen_p, gen_q)

sigmas=[2**x for x in np.linspace(-5, 5, 11)]
print "Choosing kernel width from", ["{0:.2f}".format(sigma) for sigma in sigmas]

for i in range(len(sigmas)):
    mmd.add_kernel(sg.GaussianKernel(10, sigmas[i]))

# optmal kernel choice is possible for linear time MMD

# must be set true for kernel selection

# select best kernel

print "Best single kernel has bandwidth %.2f" % best_kernel.get_width()
Choosing kernel width from ['0.03', '0.06', '0.12', '0.25', '0.50', '1.00', '2.00', '4.00', '8.00', '16.00', '32.00']
Best single kernel has bandwidth 32.00

Now perform two-sample test with that kernel

In [20]:

print "Bootstrapping: P-value of MMD test with optimal kernel is %.2f" % p_value_best
Bootstrapping: P-value of MMD test with optimal kernel is 0.12

For the linear time MMD, the null and alternative distributions look different than for the quadratic time MMD as plotted above. Let's sample them (takes longer, reduce number of samples a bit). Note how we can tell the linear time MMD to smulate the null hypothesis, which is necessary since we cannot permute by hand as samples are not in memory)

In [21]:

# sample null and alternative distribution, implicitly generate new data for that

for i in range(num_samples):

And visualise again. Note that both null and alternative distribution are Gaussian, which allows the fast null distribution approximation and the optimal kernel selection

In [22]:
plot_alt_vs_null(alt_samples, null_samples, alpha)

Soon to come:

  • Two-sample tests on strings
  • Two-sample tests on audio data (quite fun)
  • Testing for independence with the Hilbert Schmidt Independence Criterion


[1]: Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. (2012). A Kernel Two-Sample Test. Journal of Machine Learning Research, 13, 671–721.

[2]: Gretton, A., Fukumizu, K., Harchaoui, Z., & Sriperumbudur, B. K. (2012). A fast, consistent kernel two-sample test. In Advances in Neural Information Processing Systems (pp. 673–681).

[3]: Gretton, A., Sriperumbudur, B., Sejdinovic, D., Strathmann, H., Balakrishnan, S., Pontil, M., & Fukumizu, K. (2012). Optimal kernel choice for large-scale two-sample tests. In Advances in Neural Information Processing Systems.

[4]: Sriperumbudur, B., Fukumizu, K., Gretton, A., Lanckriet, G. R. G., & Schölkopf, B. (2009). Kernel choice and classifiability for RKHS embeddings of probability distributions. In Advances in Neural Information Processing Systems

[5]: Strathmann, H. (2012). M.Sc. Adaptive Large-Scale Kernel Two-Sample Testing. University College London.

In [23]: