Restricted Boltzmann Machines & Deep Belief Networks

by Khaled Nasr as a part of a GSoC 2014 project mentored by Theofanis Karaletsos and Sergey Lisitsyn

In this notebook we'll take a look at training and evaluating restricted Boltzmann machines and deep belief networks in Shogun.


Restricted Boltzmann Machines

An RBM is an energy based probabilistic model. It consists of two groups of variables: the visible variables $ v $ and the hidden variables $ h $. The key assumption that RBMs make is that the hidden units are conditionally independent given the visible units, and vice versa.

The RBM defines its distribution through an energy function $E(v,h)$, which is a function that assigns a number (called energy) to each possible state of the visible and hidden variables. The probability distribution is defined as:

$$ P(v,h) := \frac{\exp(-E(v,h))}{Z} , \qquad Z = \sum_{v,h} \exp(-E(v,h))$$

where $Z$ is a constant that makes sure that the distribution sums/integrates to $1$. This distribution is also called a Gibbs distribution and $Z$ is sometimes called the partition function.

From the definition of $P(v,h)$ we can see that the probability of a configuration increases as its energy decreases. Training an RBM in an unsupervised manner involves manipulating the energy function so that it would assign low energy (and therefore high probability) to values of $v$ that are similar to the training data, and high energy to values that are far from the training data.

For an RBM with binary visible and hidden variables the energy function is defined as:

$ E(v,h) = -\sum_i \sum_j h_i W_{ij} v_j - \sum_i h_i c_i - \sum_j v_j b_j $

where $b \in \mathbb{R^n} $ is the bias for the visible units, $c \in \mathbb{R^m}$ is the bias for hidden units and $ W \in \mathbb{R^{mxn}}$ is the weight matrix between the hidden units and the visible units.

Plugging that definition into the definition of the probability distribution will yield the following conditional distributions for each of the hidden and visible variables:

$$ P(h=1|v) = \frac{1}{1+exp(-Wv-c)}, \quad P(v=1|h) = \frac{1}{1+exp(-W^T h-b)} $$

We can do a quick visualization of an RBM:

In [1]:
%pylab inline
%matplotlib inline
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')
import networkx as nx
G = nx.Graph()
pos = {}

for i in range(8):
    pos['V'+str(i)] = (i,0)
    pos['H'+str(i)] = (i,1)
    for j in range(8): G.add_edge('V'+str(j),'H'+str(i))

nx.draw(G, pos, node_color='y', node_size=750)
Populating the interactive namespace from numpy and matplotlib

The nodes labeled V are the visible units, the ones labeled H are the hidden units. There's an indirected connection between each hidden unit and all the visible units and similarly for visible unit. There are no connections among the visible units or among the hidden units, which implies the the hidden units are independent of each other given the visible units, and vice versa.

Deep Belief Networks

If an RBM is properly trained, the hidden units learn to extract useful features from training data. An obvious way to go further would be transform the training data using the trained RBM, and train yet another RBM on the transformed data. The second RBM will learn to extract useful features from the features that the first RBM extracts. The process can be repeated to add a third RBM, and so.

When stacked on top of each other, those RBMs form a Deep Belief Network [1]. The network has directed connections going from the units in each layer to units in the layer below it. The connections between the top layer and the layer below it are undirected. The process of stacking RBMs to form a DBN is called pre-training the DBN.

After pre-training, the DBN can be used to initialize a similarly structured neural network which can be used for supervised classification.

We can do a visualization of a 4-layer DBN:

In [2]:
G = nx.DiGraph()
pos = {}

for i in range(8):
    pos['V'+str(i)] = (i,0)
    pos['H'+str(i)] = (i,1)
    pos['P'+str(i)] = (i,2)
    pos['Q'+str(i)] = (i,3)
    for j in range(8): 

nx.draw(G, pos, node_color='y', node_size=750)

RBMs in Shogun

RBMs in Shogun are handled through the CRBM class. We create one by specifying the number of visible units and their type (binary, Gaussian, and Softmax visible units are supported), and the number of hidden units (only binary hidden units are supported).

In this notebook we'll train a few RBMs on the USPS dataset for handwritten digits. We'll have one RBM for each digit class, making 10 RBMs in total:

In [3]:
from modshogun import RBM, RBMVUT_BINARY, Math

# initialize the random number generator with a fixed seed, for repeatability

rbms = []
for i in range(10):
    rbms.append(RBM(25, 256, RBMVUT_BINARY)) # 25 hidden units, 256 visible units (one for each pixel in a 16x16 binary image)

Next we'll load the USPS dataset:

In [4]:
from import loadmat

dataset = loadmat(os.path.join(SHOGUN_DATA_DIR, 'multiclass/usps.mat'))

Xall = dataset['data']
# the usps dataset has the digits labeled from 1 to 10 
# we'll subtract 1 to make them in the 0-9 range instead
Yall = np.array(dataset['label'].squeeze(), dtype=np.double)-1 

Now we'll move on to training the RBMs using Persistent Contrastive Divergence [2]. Training using regular Contrastive Divergence [3] is also supported.The optimization is performed using Gradient Descent. The training progress can be monitored using the reconstruction error or the psuedo-likelihood. Check the public attributes of the CRBM class for all the available training options.

In [5]:
from modshogun import RealFeatures, RBMMM_PSEUDO_LIKELIHOOD

# uncomment this line to allow the training progress to be printed on the console
#from modshogun import MSG_INFO; rbms[0].io.set_loglevel(MSG_INFO)

for i in range(10):
    # obtain the data for digit i
    X_i = Xall[:,Yall==i]
    # binarize the data for use with the RBM
    X_i = (X_i>0).astype(float64)
    # set the number of contrastive divergence steps
    rbms[i].cd_num_steps = 5
    # set the gradient descent parameters
    rbms[i].gd_learning_rate = 0.005
    rbms[i].gd_mini_batch_size = 100
    rbms[i].max_num_epochs = 30
    # set the monitoring method to pseudo-likelihood
    rbms[i].monitoring_method = RBMMM_PSEUDO_LIKELIHOOD
    # start training

After training, we can draw samples from the RBMs to see what they've learned. Samples are drawn using Gibbs sampling. We'll draw 10 samples from each RBM and plot them:

In [6]:
samples = zeros((256,100))
for i in range(10):
    # initialize the sampling chain with a random state for the visible units
    # run 10 chains for a 1000 steps to obtain the samples
    samples[:,i*10:i*10+10] = rbms[i].sample_group(0, 1000, 10)

# plot the samples
for i in range(100):
	ax.imshow(samples[:,i].reshape((16,16)), interpolation='nearest', cmap = cm.Greys_r)

DBNs in Shogun

Now we'll create a DBN, pre-train it on the digits dataset, and use it initialize a neural network which we can use for classification.

DBNs are handled in Shogun through the CDeepBeliefNetwork class. We create a network by specifying the number of visible units it has, and then add the desired number of hidden layers using add_hidden_layer(). When done, we call initialize_neural_network() to initialize the network:

In [7]:
from modshogun import DeepBeliefNetwork

dbn = DeepBeliefNetwork(256) # 256 visible units
dbn.add_hidden_layer(200) # 200 units in the first hidden layer
dbn.add_hidden_layer(300) # 300 units in the second hidden layer


Then we'll pre-train the DBN on the USPS dataset. Since we have 3 layers, the DBN will be pre-trained as two RBMs: one that consists of the first hidden layer and the visible layer, the other consists of the first hidden layer and the second hidden layer. Pre-training parameters can be specified using the pt_* public attributes of the class. Each of those attributes is an SGVector whose length is the number of RBMs (2 in our case). It can be used to set the parameters for each RBM indiviually. SGVector's set_const() method can also be used to assign the same parameter value for all RBMs.

In [8]:
# take 3000 examples for training, the rest for testing
Xtrain = Xall[:,0:3000]
Ytrain = Yall[0:3000]
Xtest = Xall[:,3000:-1]
Ytest = Yall[3000:-1]

# set the number of contrastive divergence steps

# set the gradient descent parameters

# binarize the data and start pre-training

After pre-training, we can visualize the features learned by the first hidden layer by plotting the weights between some hidden units and the visible units:

In [9]:
# obtain the weights of the first hidden layer
w1 = dbn.get_weights(0)

# plot the weights between the first 100 units in the hidden layer and the visible units
for i in range(100):
	ax1.imshow(w1[i,:].reshape((16,16)), interpolation='nearest', cmap = cm.Greys_r)

Now, we'll use the DBN to initialize a CNeuralNetwork. This is done through the convert_to_neural_network() method. The neural network will consist of a CNeuralInputLayer with 256 neurons, a CNeuralLogisticLayer with 200 neurons, and another CNeuralLogisticLayer with 300 neurons. We'll also add a CNeuralSoftmaxLayer as an output layer so that we can train the network in a supervised manner. We'll also train the network on the training set:

In [10]:
from modshogun import NeuralSoftmaxLayer, MulticlassLabels

# get the neural network
nn = dbn.convert_to_neural_network(NeuralSoftmaxLayer(10))

# add some L2 regularization
nn.l2_coefficient = 0.0001

# start training
_ = nn.train(RealFeatures(Xtrain))

And finally we'll measure the classification accuracy on the test set:

In [11]:
from modshogun import MulticlassAccuracy

predictions = nn.apply_multiclass(RealFeatures(Xtest))
accuracy = MulticlassAccuracy().evaluate(predictions, MulticlassLabels(Ytest)) * 100

print "Classification accuracy on the test set =", accuracy, "%"
Classification accuracy on the test set = 92.4567254248 %