The text of DEMO program RBM in supervised learning

Elementary Restricted Boltzmann Machine and Contrastive Divergence algorithm

When reading articles about restricted Boltzmann machines I found that its importance, efficiency and complexity is highly exaggerated by the authors. May be it is only my impression, but I saw it as very bad explained elementary technique. This article and java code is my attempt to explain it in readers friendly way.

Understanding the task

Let us consider all possible bytes with 4 positive bits. Assume we need to sort them into two conventional groups with centered bits 00111100 and spread apart bits 11000011. These two bytes are ideal cases of centered and spread bits, the others may not be an ideal case, we may see this byte 01011100 as rather centered than spread and this byte 11000101 as rather spread than centered. So we want a utility that examines each byte and, if byte has 4 positive bits, classify it as centered or spread by assigning measures of belonging to each class, for example, 80% centered and 20% spread.

Elementary logical solution

Let us consider the weight matrix W

Matrix W
11-1-1-1-111
-1-11111-1-1


Let us call our ideal centered and spread bytes as training sample and multiply it by weight matrix
Matrix W
11-1-1-1-111
-1-11111-1-1
Training sample
10
10
01
01
01
01
10
10
Result
4-4
-44

The result can be interpreted as numerical characteristic of belonging the tested binary vector to a particular class. Positive number is "YES" and negative number is "NO". In order to make this result closer to our intuitive understanding they introduced function SIGMOID(x) = 1/(1 + exp(-x)). It maps every scalar to a scale (0,1). When we say we apply SIGMOID to a vector or matrix that means we apply it to its every element. After application of SIGMOID to result we get matrix

Result
0.980.02
0.020.98

If we round result matrix it turns into identity matrix. If we make left multiplication of identity matrix with weight matrix and apply SIGMOID to result we'll get approximate training sample. So the following is true with stipulation that equalities are approximate

SIGMOID( W * T) = I
SIGMOID( I * W) = TT

where W is weight matrix, T is training sample, I is indicator of which class our sample belong to and superscript denotes a transpose operation.

In this example the elements of weight matrix were logically assigned, but they supposed to be built on collected data. We show how to do that in the next section.

Contrastive Divergence

We can see that properly built weight matrix solved the problem of classification of training sample and therefore can be used for quantitative evaluation of other data. There are several algorithms that lead to building of weight matrix via multiple formal steps. One of them is Constructive Divergence.

Let us first examine our data. We have two matrices, which are training sample and class identity matrix
Training sample
10
10
01
01
01
01
10
10
Class identity
10
01

The standard terminology used in RBM is to call elements of training vectors as visible units, because we know them and elements of class identity matrix as hidden units, because they are unknown upfront. Anyone familiar with linear algebra can see the similarity with finding of left pseudoinverse T+ T = I , which is known for a century. I did not compare myself what is more effective contrastive divergence or pseudoinverse, I only point to strong similarity.
In our case we have 8 visible units, which may take values 0 or 1 and two hidden classes. We need to find weight matrix that computes numerical ratio of belonging the provided combination of values in visible units to a class. We know that sample 11000011 belong to class 1 and sample 00111100 belong to class 2.

The contrastive divergence algorithm is elementary:
  • 1. Assign random values to weight matrix W.
  • 2. Select any training sample vector (in our case it is column in training matrix)t.
  • 3. Compute first hidden vector h1 as SIGMOID(W * t) = h1.
  • 4. Make elements of h1 either 0 or 1, I make only one value equal 1, and all others 0.
  • 5. Having this modified hidden vector h1 compute visible vector v as SIGMOID(h1 * W) = vT.
  • 6. Make elements of vT either 0 or 1, in this case it is possible to have multiple 1.
  • 7. Having vT compute another hidden vector h2 as SIGMOID(W * v) = h2.
  • 8. Now having t, h1, v, h2 update matrix W as W = W + learningRate * (t * h1T - v * h2T), where learning rate is parameter from (0,1) interval, t * h1T and v * h2T are matrices of the first rank with elements either 0 or 1.
  • 9. Repeat all starting from (2) for all training vectors multiple times until W becomes stable.

I considered only small, but critical for understanding, part of RBM. I recommend to read more. I did not follow standard terminology only to make understanding of RBM easy. I recommend to use standard terminology, of course. The other interesting part of RBM (not addressed here) is unsupervised learning, when large set of data is available but there is no training sample. I hope after this elementary introduction the other materials about RBM will become clear for a new to the subject reader.

Is RBM really new?

When we have training sample, the efficiency of the method lays in coefficients of the weight matrix. So it is not clear why to bother with RBM. We have known binary matrix of training data and we have known matrix with positive or negative ones indicating where each individual column of training data belong to, as it is shown below. We can apply many utilities of linear algebra packages to find W, such as finding pseudoinverse or singular value decomposition, or elementary least squares with regularization to smooth elements of W, or other.
Unknown weight matrix
w11w12...w1n
w21w22...w2n
............
wm1wm2...wmn
Known binary training matrix
b11b12...b1k
b21b22...b2k
............
bn1bn2...bnk
Known class indicators
1-1-1...-1
...............
-1-1-1...1

When W is found we can apply SIGMOID to result and declare it RBM.


Andrew Polar, Dec, 2017.