The text of DEMO program RBM in supervised learning

Elementary Restricted Boltzmann Machine and Contrastive Divergence algorithm

When reading articles about restricted Boltzmann machines I found that its importance, efficiency and complexity is highly exaggerated by the authors. May be it is only my impression, but I saw it as very bad explained elementary technique. This article and java code is my attempt to explain it in readers friendly way.

Understanding the task

Let us consider all possible bytes with 4 positive bits. Assume we need to sort them into two conventional groups with centered bits 00111100 and spread apart bits 11000011. These two bytes are ideal cases of centered and spread bits, the others may not be an ideal case, we may see this byte 01011100 as rather centered than spread and this byte 11000101 as rather spread than centered. So we want a utility that examines each byte and, if byte has 4 positive bits, classify it as centered or spread by assigning measures of belonging to each class, for example, 80% centered and 20% spread.

Elementary logical solution

Let us consider the weight matrix W

Matrix W
11-1-1-1-111
-1-11111-1-1


Let us call our ideal centered and spread bytes as training sample and multiply it by weight matrix
Matrix W
11-1-1-1-111
-1-11111-1-1
Training sample
10
10
01
01
01
01
10
10
Result
4-4
-44

The result can be interpreted as numerical characteristic of belonging the tested binary vector to a particular class. Positive number is "YES" and negative number is "NO". In order to make this result closer to our intuitive understanding they introduced function SIGMOID(x) = 1/(1 + exp(-x)). It maps every scalar to a scale (0,1). When we say we apply SIGMOID to a vector or matrix it means we apply it to every element of it. After application of SIGMOID to result we get matrix

Result
0.980.02
0.020.98

If we round result matrix it turns into identity matrix. If we make left multiplication of identity matrix with weight matrix and apply SIGMOID to result we'll get approximate training sample. So the following is true with stipulation that equalities are approximate

SIGMOID( W * T) = I
SIGMOID( I * W) = T

where W is weight matrix, T is training sample, I is indicator of which class our sample belong to.

In this example we simply made weight matrix, but it supposed to be built on collected data. We show how to do it in the next section.

Contrastive Divergence

We can see that properly built weight matrix solved the problem of classification of training sample and therefore can be used for quantitative evaluation of other data. The contrastive divergence algorithm is used for building this weight matrix. Let us first examine our data. We have two matrices, which are training sample and class identity matrix
Training sample
10
10
01
01
01
01
10
10
Class identity
10
01

The standard terminology used in RBM is to call elements of training vectors as visible units, because we know them and elements of class identity matrix as hidden units, because they are uknown upfront. Anyone familiar with linear algebra can see the similarity with finding of left pseudoinverse T+ T = I , which is known for a century. I did not compare myself what is more effective contrastive divergence or pseudoinverse, I only point to strong similarity.
In our case we have 8 visible units, which may take values 0 or 1 and two hidden classes. We need to find weight matrix that computes numerical ratio of belonging the provided combination of values in visible units to a class. We know that sample 11000011 belong to class 1 and sample 00111100 belong to class 2.

The contrastive divergence algorithm is elementary:
  • 1. Assign random values to weight matrix W
  • 2. Take any training sample vector t
  • 3. Compute first hidden vector h1 as SIGMOID(W * t) = h1
  • 4. Make elements of h1 either 0 or 1, I make only one value equal 1, and all others 0.
  • 5. Having first hidden vector h1 compute visible vector v as SIGMOID(h1 * W) = v
  • 6. Make elements of v either 0 or 1, in this case it is possible to have multiple 1.
  • 7. Having v compute another hidden vector h2 as SIGMOID(W * v) = h2
  • 8. Now having t, h1, v, h2 update matrix W as W = W + learningRate * (t * h1T - v * h2T), where learning rate is parameter from (0,1) interval, t * h1T and v * h2T are matrices of the first rank with elements either 0 or 1.
  • 9. Repeat all starting from (2) for all training vectors multiple times until W becomes stable.

I considered only small but critical for understanding part of RBM. I recommend to read more. I did not follow standard terminology only to make understanding of RBM easy. I recommend to use standard terminology, of course. The other interesting part of RBM (not addressed here) is unsupervised learning, when large set of data is available but there is no training sample. I hope after this elementary introduction the other materials about RBM will become clear for a new to the subject reader.


Andrew Polar, Dec, 2017.