Variational Inference for Gaussian Process Classification Models

By Wu Lin - yorker.lin@gmail.com - https://github.com/yorkerlin Suggested by Heiko Strathmann and Mohammad Emtiyaz Khan

Based on the notebook of Gaussian Processes by Heiko Strathmann - heiko.strathmann@gmail.com - github.com/karlnapf - herrstrathmann.de, and the GP framework of the Google summer of code 2014 project of Wu Lin, - Google summer of code 2013 project of Roman Votyakov - votjakovr@gmail.com - github.com/votjakovr, and the Google summer of code 2012 project of Jacob Walker - walke434@gmail.com - github.com/puffin444

A Gaussian process (GP) model is a flexible model, which can be used to do regression, classification, dimension reduction, and deep learning. A GP model is a Bayesian non-parametric model using kernel methods. This notebook describes how to do variational inference for Gaussian Process Classification (GPC) models in Shogun. In a GPC model, a set of GP functions are used to predict a discrete label given its features of a data point.

For the theory about GPC, we assume that readers have some background in Bayesian statistics and basic knowledge of Gaussian Processes. For the background, please see the notebook or another notebook about Gaussian Processes. After providing a semi-formal introduction, we illustrate how to do training, make predictions, and automatically learn parameters of GPC models (model selection) in Shogun.

In [1]:
%matplotlib inline
# import all shogun classes
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
from modshogun import *

# import all required libraries
import scipy
import scipy.io
import numpy as np
from math import exp,sqrt,log
import random
import time
import matplotlib.pyplot as plt
import matplotlib.cm as cm
  • Unified API across 10 major languages (Python, R, Octave, Java, and more)
  • Better performance (written in C++ and parallel computing) compared to scikit-learn and GPML (TODO adding some speedup statistics)
  • More (6 metnods) medium-scale variational inference methods compared to scikit-learn and GPML
  • More (8 methods, including 5 ongoing) large-scale variational inference methods compared to scikit-learn and GPML
  • Efficient implementation of more than 25 kernels (covariance functions)
  • Supporting GP multi-classification
  • Inference methods for Standard GP models

    • Exact method for GP regression models
    • Variational methods for GP classification models
  • Inference methods for sparse GP models (also known as large scale inference methods)

    • Fully Independent Training Conditional (FITC) methods for GP regression models
    • FITC Laplace methods for GP classification models
    • FITC KL methods for GP regression and classification models
    • Parallel/Distributed Variational Inference for GP regression and classification models (Ongoing)
    • Stochastic Variational Inference for GP regression and classification models (Ongoing)
    • Nonparametric Variational Inference for GP regression and classification models (TODO)
  • Beyond GP models for regression and classification

    • GP latent variable models for dimension reduction (Ongoing)
    • Deep GP models for deep learning (TODO)

In this notebook, we mainly focus on binary classification problems. It is not hard to exend the presented methodology to multi-classifcation problems. Given a binary classification problem, binary labels and features are respectively defined as a column vector, $\mathbf{y}\in\mathcal{Y}^n=\{-1,+1\}^n$, and a matrix, $\mathbf{X}\in\mathcal{R}^{n\times d}$, where $n$ is the number of data points and $d$ is the number of features. A likelihood function $p(\mathbf{y}|\text{f},\mathbf{X})=p(\mathbf{y}|\mathbf{f})$ is used to model data with labels, where a latent function $\text{f}:\mathcal{R}^{d}\to \mathcal{R} $ is drawn from a GP prior and $\text{f}(\mathbf{X})=\mathbf{f} \in \mathcal{R}^{n}$ is a column vector by applying $\text{f}$ to each row of $\mathbf{X}$. Given data with labels, our goal is to train a GP classifier ($p(\text{f}|\mathbf{y},\mathbf{X} $)) to fit the data well and predict new data points ($p(y_\text{new}|\mathbf{y},\mathbf{X},\mathbf{x}_\text{new})$). The following sections cover detailed background of variational inference for GPC models.

Note that the difference between $\text{f}$ and $\mathbf{f}$ is that $\text{f}$ is a function while $\mathbf{f}$ is a n-dimensional vector. The GP prior, $p(\text{f})$, and $p(\mathbf{f})=p(\text{f}(\mathbf{X}))$ are also different because $p(\text{f})$ is a distribution in an infinite dimensional (function) space while $p(\mathbf{f})$ is a distribution in a finite dimensional real space.

To gain some intuition how these GPC models behave, and how well variational Gaussian inference methods approximate posteriors, please see section A Toy Example of Variational Gaussian Inference and section An Example of for Visualization.

We want to learn posterior, $p(\text{f}|\mathbf{y},\mathbf{X}) \propto p(\mathbf{y}| \text{f},\mathbf{X}) \times p(\text{f}|\mathbf{X}) $, given training data points. Note that $p(\text{f})$ is the prior over the GP functions. We will see in next section that all we need is to learn $p(\mathbf{f}|\mathbf{y})=p(\text{f}(\mathbf{X})|\mathbf{y},\mathbf{X})$, which is a distribution in an finite dimensional space. (Recall that we use $\mathbf{f}$ as $\text{f}(\mathbf{X})$ and $p(\mathbf{f}|\mathbf{y})$ as $p(\text{f}(\mathbf{X})|\mathbf{y},\mathbf{X})$ for short).

The key intuition of variational inference is to approximate the distribution of interest (here $p(\mathbf{f}|\mathbf{y})$, which is a non-Gaussian distribution) with a more tractable distribution (for variational Gaussian inference, a Gaussian $q(\mathbf{f}|\mathbf{y}))$, via minimizing the Kullback–Leibler divergence (KL divergence)

$${\mathrm{KL}}(Q\|P) = \int_{-\infty}^\infty \ln\left(\frac{q(\mathbf{f}|\mathbf{y})}{p(\mathbf{f}|\mathbf{y})}\right) q(\mathbf{f}|\mathbf{y}) \ {\rm d}\mathbf{f}$$

between the true distribution and the approximation.

Please see next section if you want to know the reason why making prediction of GPC is not easy when $p(\mathbf{f}|\mathbf{y})$ is non-Gaussian. Therefore, we want to use a more tractable distribution $q(\mathbf{f}|\mathbf{y}))$ to perform prediction

Throughout this notebook, we deal with binary classification problems using inverse-logit, (also known as Bernoulli-logistic function) likelihood to model labels, given by

$p(\mathbf{y}|\mathbf{f})=\prod_{i=1}^n p(y_\text{i}|\text{f}(\mathbf{X}_\text{i}))=\prod_{i=1}^n \frac{1}{1+\exp(-y_\text{i} \mathbf{f}_\text{i})}$, where $\mathbf{f}_\text{i}$ is the i-th row of $\mathbf{X}$ and $\text{f}(\mathbf{X}_\text{i})=\mathbf{f}_\text{i}$

Note that In GPC models we implicitly assume that $\mathbf{y}$ are conditional independent if (latent) $\mathbf{f}$ are known.

In GPC models, our goal is to predict the label, $y_\text{new}$, of a new data point, a row vector, $\mathbf{x}_\text{new}\in\mathcal{R}^{d}$ based on $p(y_\text{new}|\mathbf{y},\mathbf{X},\mathbf{x}_\text{new})$. According to Bayes' theorem, we know that the posterior predictive distribution is $$p(y_\text{new}|\mathbf{y},\mathbf{X},\mathbf{x}_\text{new})= \int {p(y_\text{new}|\text{f},\mathbf{x}_\text{new})p(\text{f}|\mathbf{y},\mathbf{X}) d\text{f}}$$ Informally, according to the property of GP about marginalization, we have:

$$\int {p(\mathbf{y}_\text{new}|\text{f},\mathbf{x}_\text{new})p(\text{f}|\mathbf{y},\mathbf{X}) d\text{f}}= \int {p(\mathbf{y}_\text{new}|\text{f}(\mathbf{x}_\text{new}),\mathbf{x}_\text{new})p(\text{f}(\mathbf{x}_\text{new})|\mathbf{y},\mathbf{X}) d\text{f}(\mathbf{x}_\text{new})} = \int {p(\mathbf{y}_\text{new}|\mathbf{f}_\text{new})p(\mathbf{f}_\text{new}|\mathbf{y},\mathbf{X}) d\mathbf{f}_\text{new}} $$

.

where $\text{f}(\mathbf{x}_\text{new})=\mathbf{f}_\text{new}$, $p(\mathbf{y}_\text{new}|\mathbf{f}_\text{new})=p(\mathbf{y}_\text{new}|\text{f},\mathbf{x}_\text{new})$ .

The key difference here is that $p(\text{f}|\mathbf{y}, \mathbf{X})$ is a distribution in an infinite dimensional space while $p(\mathbf{f}_{new}|\mathbf{y},\mathbf{X})$ is a distribution in a one-dimensional space.

Note that $p(\mathbf{f}_\text{new}|\mathbf{y},\mathbf{X})=\int {p(\mathbf{f}_\text{new}|\text{f}) p(\text{f}|\mathbf{y},\mathbf{X}) d\text{f}}$.

Similarly, $$\int {p(\mathbf{f}_\text{new}|\text{f}) p(\text{f}|\mathbf{y},\mathbf{X}) d\text{f}}=\int {p(\mathbf{f}_\text{new}|\text{f}(\mathbf{X})) p(\text{f}(\mathbf{X})|\mathbf{y},\mathbf{X}) d\text{f}(\mathbf{X}) } = \int {p(\mathbf{f}_\text{new}|\mathbf{f}) p(\mathbf{f}|\mathbf{y}) d\mathbf{f}} $$.

Recall that $p(\mathbf{f}|\mathbf{y})=p(\text{f}(\mathbf{X})|\mathbf{y}, \mathbf{X})$ is a distribution in a n-dimensional space while $p(\text{f}|\mathbf{y},\mathbf{X})$ is a distribution in an infinite dimensional space.

Informally, accodring to GP, the following holds:

$p(\mathbf{f}_\text{new}|\mathbf{f})=\frac{p(\mathbf{f}_\text{new},\mathbf{f})}{p(\mathbf{f})}= \frac{p(\text{f}(\mathbf{ \left[ \begin{array}{c} \mathbf{X} \\ \hdashline \mathbf{x}_\text{new} \end{array} \right] }))}{p(\text{f}(\mathbf{\mathbf{X}}))}$, where $\mathbf{ \left[ \begin{array}{c} \mathbf{X} \\ \hdashline \mathbf{x}_\text{new} \end{array} \right] } \in \mathcal{R}^{(n+1)\times d}$ and $\text{f}(\mathbf{ \left[ \begin{array}{c} \mathbf{X} \\ \hdashline \mathbf{x}_\text{new} \end{array} \right] }) \in \mathcal{R}^{n+1} $

Note that If $p(\mathbf{f}|\mathbf{y})$ is not a Gaussian distribution, $p(\mathbf{f}_{new}|\mathbf{y},\mathbf{X})$ usually has not a closed form. Unfortunately, in classification problems, $p(\mathbf{f}|\mathbf{y})$ usually is a non-Gaussian distribution since $p(\mathbf{f}|\mathbf{y}) \propto p(\mathbf{y}|\mathbf{f})p(\mathbf{f})$, where $p(\mathbf{f})$ is a Gaussian distribution but $p(\mathbf{y}|\mathbf{f}))$ (the likelihood) is a non-Gaussian distribution to model discrete labels.

Variational Gaussian inference in GPC is to approximate $p(\mathbf{f}|\mathbf{y})$ using a Gaussian distribution, $q(\mathbf{f}|\mathbf{y})$, via minimizing the KL divergence, ${\mathrm{KL}}(Q\|P)$. Readers may note that the KL divergence is asymmetric. If we minimize ${\mathrm{KL}}(P\|Q)$ instead of ${\mathrm{KL}}(Q\|P)$, the inference method is called expectation propagation (EP).

As mentioned in section Big Picture of Variational Inference for GPC Models, variaitonal inference in GPC models is about minimizing the KL divergence given as below:

${\mathrm{KL}}(Q\|P) = \int_{-\infty}^\infty \ln\left(\frac{q(\mathbf{f}|\mathbf{y})}{p(\mathbf{f}|y)}\right) q(\mathbf{f}|\mathbf{y}) \ {\rm d}\mathbf{f} = \int_{-\infty}^\infty \ln\left(\frac{q(\mathbf{f}|\mathbf{y})}{ \frac{p(\mathbf{y}|\mathbf{f})p(\mathbf{f})}{p(\mathbf{y})} }\right) q(\mathbf{f}|\mathbf{y}) \ {\rm d}\mathbf{f}=\int_{-\infty}^\infty \ln\left(\frac{q(\mathbf{f}|\mathbf{y})}{ p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\mathbf{y}) }\right) q(\mathbf{f}|\mathbf{y}) \ {\rm d}\mathbf{f} - \text{const}$.

Another way to explain variational inference in GPC is to maximize a lower (variational) bound of the log of marginal likelihood, $\ln (p(\mathbf{y}|\mathbf{X})) $.

$\ln (p(\mathbf{y}|\mathbf{X})) = \ln (\int_{-\infty}^\infty {p(\mathbf{y}|\text{f},\mathbf{X})p(\text{f}|\mathbf{X})} d\text{f}) = \ln (\int_{-\infty}^\infty {p(\mathbf{y}|\mathbf{f})p(\mathbf{f})} d\mathbf{f})= \ln (\int_{-\infty}^\infty { q(\mathbf{f}) \frac{ p(\mathbf{y}|\mathbf{f})p(\mathbf{f})} {q(\mathbf{f}|\mathbf{y})}} d\mathbf{f}) \geq (\int_{-\infty}^\infty { q(\mathbf{f}|\mathbf{y}) \ln ( \frac{ p(\mathbf{y}|\mathbf{f})p(\mathbf{f})} {q(\mathbf{f}|\mathbf{y})}} ) d\mathbf{f})$

where the inequality is based on Jensen’s inequality and $\int_{-\infty}^\infty { q(\mathbf{f}|\mathbf{y}) \ln ( \frac{ p(\mathbf{y}|\mathbf{f})p(\mathbf{f})} {q(\mathbf{f}|\mathbf{y})}} ) d\mathbf{f}$ is a lower (variational) bound .

$\int_{-\infty}^\infty { q(\mathbf{f}|\mathbf{y}) \ln ( \frac{ p(\mathbf{y}|\mathbf{f})p(\mathbf{f})} {q(\mathbf{f}|\mathbf{y})}} ) d\mathbf{f}=- \int_{-\infty}^\infty \ln\left(\frac{q(\mathbf{f}|\mathbf{y})}{ p(\mathbf{y}|\mathbf{f})p(\mathbf{f}) }\right) q(\mathbf{f}|\mathbf{y}) \ {\rm d}\mathbf{f} = -E_q[\ln(q(\mathbf{f}|\mathbf{y}))] + E_q[\ln(p(\mathbf{f}))] + E_q[\ln(p(\mathbf{y}|\mathbf{f}))]$,

where $E_q(\cdot)$ denotes the expectation with respect to $q(\mathbf{f}|\mathbf{y})$.

Note that in general variaitonal inference, $q(\mathbf{f}|\mathbf{y})$ can be a free-form distribution learnt by minimizing the KL divergence. In variational Gaussian inference, $q(\mathbf{f}|\mathbf{y})$ is a Gaussian distribution and distribution parameters are estimated by minimizing the KL divergence

In variaitonal Gaussian inference in GPC, the last term, $E_q[\ln(p(\mathbf{y}|\mathbf{f}))]$, usually does not have a closed form. Recall that $p(\mathbf{y}|\mathbf{f})$ usually is a non-Gaussian distrbution.

In variational Gaussian inference, we can show that $E_q[\ln(p(\mathbf{y}|\mathbf{f}))]$ is summation of one-dimensional integrations via the similar marginalization property of multivariate Gaussian distribution. Recall that labels ($\mathbf{y}$) are conditional independent given $\mathbf{f}$. We can obtain the following equations:

\begin{equation} \begin{array}{ll} E_q[\ln(p(\mathbf{y}|\mathbf{f}))] &\\ =E_q[\sum_{i=1}^n {\ln(p(\mathbf{y}_\text{i}|\mathbf{f}_\text{i})}] & \text{(conditional independence)} \\ =\sum_{i=1}^n {E_q[\ln(p(\mathbf{y}_\text{i}|\mathbf{f}_\text{i})]} & \text{(linearity of expectation)} \\ =\sum_{i=1}^n {E_{q_\text{i}}[\ln(p(\mathbf{y}_\text{i}|\mathbf{f}_\text{i})]} & \text{(marginalization property)} \\ \end{array} \end{equation}

where $q$ denotes $q(\mathbf{f}|\mathbf{y})$ is a multivariable $N(\mu, \Sigma)$, $q_i$ denotes $q_i(\mathbf{f}_\text{i}|\mathbf{y}_\text{i})$ is a univariate $N(\mu_\text{i}, \Sigma_\text{i,i})$, and $\mu_\text{i}$ and $\Sigma_\text{i,i}$ are the i-th element of the mean $\mu$ and the i-th diagonal element of the covariance matrix $\Sigma$ of $q(\mathbf{f}|\mathbf{y})$ respectively.

Of course, $E_{q_\text{i}}[\ln(p(\mathbf{y}_\text{i}|\mathbf{f}_\text{i})]$ usually does not have a closed form in GPC. However, it is relatively easy to deal with this one-dimensional problem.
One way to deal with this issue is numerical integration or monte carlo methods. Another way is to use some variational lower bound of $E_{q_\text{i}}[\ln(p(\mathbf{y}_\text{i}|\mathbf{f}_\text{i})]$. For now, we assume this expectation, $E_{q_\text{i}}[\ln(p(\mathbf{y}_\text{i}|\mathbf{f}_\text{i})]$, is given and we will discuss later in section Some Background of Variational Bounds in GPC Models

Inference Method in GPC Idea of Approximation
Laplace Method Using the second-order Taylor expansion in the mode of the true posterior
Covariance Variational Method Minimizing KL(approximated distribution || true posterior), where the approximated Gaussian distribution has a complete covariance matrix
Mean-field Variational Method Minimizing KL(approximated distribution || true posterior), where the approximated Gaussian distribution has a diagonal covariance matrix
Cholesky Variational Method Minimizing KL(approximated distribution || true posterior), where the approximated Gaussian distribution has a complete covariance matrix in Cholesky representation
Dual Variational Method Minimizing KL(approximated distribution || true posterior) using variational Gaussian inference in a dual form
Expectation Propagation Method Minimizing KL(true posterior || approximated distribution), where the approximated distribution is a Gaussian distribution

Now, we present a 2-D example for GP binary classification using variational Gaussian inference. The gaussian prior and the likelihood are shown as below. For this simple example, we can numerically compute the unnormalized true posterior, shown in the following figure, using the Bayes rule: $p(\mathbf{f}|\mathbf{y}) \propto p(\mathbf{y}|\mathbf{f})p(\mathbf{f})$, where $p(\mathbf{f})$ is the prior and $p(\mathbf{y}|\mathbf{f}))$ is the likelihood. The approximated posterior obtained using various methods are also given. The figure can be obtained by using the following code. Note that this example is closely related to the example at page 7-8 of the paper, Approximations for binary Gaussian process classification.

In [2]:
def Gaussian_points(Sigma,mu,xmin,xmax,ymin,ymax,delta=0.01):
    """
    This function is used to evaluate the likelihood of a Gaussian distribution on mesh.
    
    Parameters:
         Sigma - covariance of Gaussian
         mu - mean of Gaussian
         xmin, xmax, ymin, ymax, delta - defines mesh
         
    Returns:
    X, Y, Z, where Z = log(p(X,Y)) and p is a bivariate gaussian (mu, Sigma) evaluated at a mesh X,Y
    """

    xlist = np.arange(xmin, xmax, delta)
    ylist = np.arange(ymin, ymax, delta)
    X, Y = np.meshgrid(xlist, ylist)
    model = GaussianDistribution(mu, Sigma)
    Z = np.zeros(len(X)*len(Y))
    idx = 0
    for items in zip(X,Y):
        for sample in zip(items[0],items[1]):
            sample = np.asarray(sample)
            Z[idx] = model.log_pdf(sample)
            idx += 1
    Z = np.asarray(Z).reshape(len(X),len(Y))
    return (X,Y,Z)
In [3]:
def likelihood_points(X,Y,labels,likelihood):
    """
    This function is used to evaluate a given likelihood function on a given mesh of points
    
    Parameters:
        X, Y - coordiates of the mesh
        labels - labels for the likelihood
        likelihood - likelihood function
        
    Returns:
        Z - log(p(X,Y,labels)), where p comes from likelihood
    """
    
    Z = np.zeros(len(X)*len(Y))    
    idx = 0
    for items in zip(X,Y):
        for sample in zip(items[0],items[1]):
            sample = np.asarray(sample)
            lpdf = likelihood.get_log_probability_f(labels, sample).sum()
            Z[idx] = lpdf
            idx += 1
    Z = np.asarray(Z).reshape(len(X),len(Y))
    return Z
In [4]:
def approx_posterior_plot(methods, kernel_func, features, mean_func, 
                          labels, likelihoods, kernel_log_scale, 
                          xmin, xmax, ymin, ymax, delta, plots):
    """
    This function is used to generate points drawn from approximated posterior and plot them
        
    Parameters:
        methods - a list of methods used to approximate the posterior
        kernel_func - a covariance function for GP
        features - X
        mean_func -  mean function for GP
        labels - Y
        likelihood- a data likelihood to model labels
        kernel_log_scale - a hyper-parameter of covariance function
        xmin, ymin, xmax, ymax, delta - used for generating points from an approximated posterior
        plots - subplot
    Returns:
        Nothing
    """
    
    (rows, cols) = plots.shape
    methods = np.asarray(methods).reshape(rows, cols)
    likelihoods = np.asarray(likelihoods).reshape(rows, cols)
    for r in xrange(rows):
        for c in xrange(cols):
            inference = methods[r][c]
            likelihood = likelihoods[r][c]
            inf = inference(kernel_func, features, mean_func, labels, likelihood())
            inf.set_scale(exp(kernel_log_scale))
            #get the approximated Gaussian distribution
            mu = inf.get_posterior_mean()
            Sigma = inf.get_posterior_covariance()
            #normalized approximated posterior
            (X,Y,Z) = Gaussian_points(Sigma, mu, xmin, xmax, ymin, ymax, delta)
            CS = plots[r][c].contour(X, Y, np.exp(Z))
            plots[r][c].set_title('posterior via %s'%inf.get_name())
            plots[r][c].axis('equal')
            plots[r][c].clabel(CS,inline=1,fontsize=10)
In [5]:
#a toy 2D example (data)
x=np.asarray([sqrt(2),-sqrt(2)]).reshape(1,2)
y=np.asarray([1,-1])

features = RealFeatures(x)
labels = BinaryLabels(y)
kernel_log_sigma = 1.0
kernel_log_scale = 1.5

#a mean function and a covariance function for GP
mean_func = ConstMean()
#using log_sigma as a hyper-parameter of GP instead of sigma
kernel_sigma = 2*exp(2*kernel_log_sigma)
kernel_func = GaussianKernel(10, kernel_sigma)
kernel_func.init(features, features)
#a prior distribution derived from GP via applying the mean function and the covariance function to data
Sigma = kernel_func.get_kernel_matrix()
Sigma = Sigma * exp(2.0*kernel_log_scale)
mu = mean_func.get_mean_vector(features)

delta = 0.1
xmin = -4
xmax = 6
ymin = -6
ymax = 4

#a prior (Gaussian) derived from GP
(X,Y,Z1) = Gaussian_points(Sigma, mu, xmin, xmax, ymin, ymax, delta)

col_size = 6
f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(col_size*3,col_size))

CS1=ax1.contour(X, Y, np.exp(Z1))
ax1.set_title('Prior')
ax1.axis('equal')

ax1.clabel(CS1,
           inline=1,
           fontsize=10)

#likelihood (inverse logit, A.K.A. Bernoulli-logistic)
#likelihoods classes for inference methods  (see the likelihood table)
likelihoods=[
LogitLikelihood,
LogitVGLikelihood,
#LogitVGLikelihood,
LogitVGLikelihood,
LogitDVGLikelihood
]
#likelihood
Z2 = likelihood_points(X,Y,labels,LogitLikelihood())
CS2 = ax2.contour(X, Y, np.exp(Z2))
ax2.set_title('Likelihood')
ax2.axis('equal')
ax2.clabel(CS2,
           inline=1,
           fontsize=10)

#a unnormalized true posterior (non-Gaussian)
Z3 = Z1+Z2
CS3 = ax3.contour(X, Y, np.exp(Z3))
ax3.set_title('True posterior')
ax3.axis('equal')
ax3.clabel(CS3,
           inline=1,
           fontsize=10)

f, plots =plt.subplots(2, 2, figsize=(col_size*3,col_size*2))

#Inference methods used to approximate a posterior (see the table below)
methods=[
SingleLaplaceInferenceMethod,
KLDiagonalInferenceMethod,
#KLCovarianceInferenceMethod,
KLCholeskyInferenceMethod,
KLDualInferenceMethod
]
#posterior
approx_posterior_plot(methods, kernel_func, features, mean_func, labels, likelihoods, kernel_log_scale, 
                      xmin, xmax, ymin, ymax, delta, plots)

plt.show()

Remark: The true posterior is a non-Gaussian distribution and the normalized constant usually is difficult to compute. All approximated posteriors are Gaussian distributions. A posterior Gaussian distribution can efficiently perform inference and predictions. The Laplace approximation is not ideal in term of approximating a posterior and making predictions although it is fast. With the price of speed, variational methods such as the KLCovariance and the KLCholesky offer more accurate approximations compared than the Laplace method. The KLDual method also can have a good approximation and in many cases are better than the Laplace method. On the other hand, the KLApproxDiagonal (mean-field) method is as fast as the Laplace method and for certain datasets, the KLApproxDiagonal method is more accurate than Laplace method in term of making predictions.

We use the banana data set to demonstrate decision boundary of GPC. The data set can be found at here. This is a binary classification problem. The goal of this example is to visually show the decision boundary of GPC models. The following is the description of the data set.

It is an artificial data set where instances belongs to several clusters with a banana shape. There are two attributes At1 and At2 corresponding to the x and y axis, respectively. The class label (-1 and 1) represents one of the two banana shapes in the dataset.

We now define code to train a GP classifier using Shogun.

In [6]:
def train_small_scale(inference, linesearch, likelihood, x_train, y_train, kernel_log_sigma, kernel_log_scale):
    
    """
    This function is used to train a GP classifer given an inference method
        
    Parameters:
        inference - an inference methods used to train the classifer
        linesearch - a linesearch method used in the inference method 
        likelihood - likelihood function to model data
        x_train, x_test - X for training and testing
        y_train - labels for training
        kernel_log_sigma, kernel_log_scale hyper-parameters of covariance function    
    Returns:
        trained GPC model, name of inference method
    """

    mean_func = ZeroMean()
    kernel_sigma = 2*exp(2*kernel_log_sigma);
    kernel_func = GaussianKernel(10, kernel_sigma)

    #Y is a sample-by-1 vector
    labels_train = BinaryLabels(y_train)
    #X is a feature-by-sample matrix
    features_train=RealFeatures(x_train)

    inf = inference(kernel_func, features_train, mean_func, labels_train, likelihood)
    inf.set_scale(exp(kernel_log_scale))

    gp = GaussianProcessClassification(inf)
    gp.train()

    return gp, inf.get_name()
In [7]:
def extract_banana_dataset(input_path):
    """
    This function is used to process raw data of the banana data set
        
    Parameters:
        inf - the path of the raw data
    Returns:
        x_train, x_test - features for training and testing
        y_train, y_test - labels for training and testing
        x1s, x2s  -for ploting training or testing points
    """

    random.seed(1)
    x=[[], []]
    y=[[], []]
    x1s=[{}, {}]
    x2s=[{}, {}]

    for line in open(input_path):
        line=line.strip()
        if line.startswith("@"):
            continue
        choice=random.randint(0,1)
        if choice==0:
            #testing set
            idx=1
        else:
            #training set
            idx=0
        info=map(float,line.split(','))
        x[idx].append(info[:-1])
        y[idx].append(info[-1])
        if info[-1]>0:
            #positive cases
            container_x1=x1s[idx].setdefault(1,[])
            container_x2=x2s[idx].setdefault(1,[])
        else:
            #negative cases
            container_x1=x1s[idx].setdefault(0,[])
            container_x2=x2s[idx].setdefault(0,[])
        container_x1.append(info[0])
        container_x2.append(info[1])

    x_train = np.asarray(x[0]).T
    y_train = np.asarray(y[0])
    x_test = np.asarray(x[1]).T
    y_test = np.asarray(y[1])

    return x_train, y_train, x_test, y_test, x1s, x2s
In [8]:
from itertools import product
def plot_small_scale(input_path):
    """
    This function is used to print the decision bounday and testing points
        
    Parameters:
        inf - the path of the raw data
    """
    #obtain data points with labels from raw data
    (x_train, y_train, x_test, y_test, x1s, x2s)=extract_banana_dataset(input_path)
    
    #binary classification problem (positive labels and negative labels)
    inference =SingleLaplaceInferenceMethod
    likelihood = LogitLikelihood()
    linesearch = 3
    
    #we show how parameters of GPC affect the decision boundary
    #over-fit, right-fit, under-fit
    kernel_log_sigmas=[-5,0,1]
    kernel_log_scale=0 #we fix the scale parameter
    
    #plot setting
    col_size=8
    f, plots =plt.subplots(len(kernel_log_sigmas),2,figsize=(col_size*2,col_size*len(kernel_log_sigmas)))
    
    #points used to generate decision boundary
    n_boundary=50
    x1_boundary = np.linspace(x_train[0,:].min()-1, x_train[0,:].max()+1, n_boundary)
    x2_boundary = np.linspace(x_train[1,:].min()-1, x_train[1,:].max()+1, n_boundary)
    x_boundary = np.asarray(list(product(x1_boundary, x2_boundary))).T

    features_boundary=RealFeatures(x_boundary)
    
    for idx,kernel_log_sigma in enumerate(kernel_log_sigmas):
        #train a GPC model given traning data points
        (gpc, name)=train_small_scale(inference, linesearch, likelihood, x_train, y_train, kernel_log_sigma, kernel_log_scale)
        #obtain the probabilities of being positive label(s) given new data point(s)
        prbs=gpc.get_probabilities(features_boundary)

        for choice in [0,1]:
             #decision boundary
             plots[idx][choice].contour(x1_boundary, x2_boundary, np.reshape(prbs, (n_boundary, n_boundary)).T, levels=[0.5], colors=('blue'))  
             #training points or testing points with true positive tag
             plots[idx][choice].scatter(x1s[choice][1],x2s[choice][1], c='red', alpha=0.5)
             #training points or testing points with true negative tag
             plots[idx][choice].scatter(x1s[choice][0],x2s[choice][0], c='yellow', alpha=0.5)
             plots[idx][choice].set_xlabel("At1")
             plots[idx][choice].set_ylabel("At2")
             plots[idx][choice].axis("equal")
             type_name="training"
             if choice==1:
                type_name="testing"
             if idx==0:
                fit_condition="over-fit"
             elif idx==1:
                fit_condition="right-fit"
             else:
                fit_condition="under-fit"                           
             plots[idx][choice].set_title("Decision boundary (blue) \n of %s on %s points (%s)"%(name, type_name,fit_condition))

    plt.show()
In [9]:
input_path=os.path.join(SHOGUN_DATA_DIR, 'toy/banana.dat')
plot_small_scale(input_path)