Elementary Restricted Boltzmann Machine and Contrastive Divergence algorithmWhen reading articles about restricted Boltzmann machines I found that its importance, efficiency and complexity is highly exaggerated by the authors. May be it is only my impression, but I saw it as very bad explained elementary technique. This article and java code is my attempt to explain it in readers friendly way.
Understanding the taskLet us consider all possible bytes with 4 positive bits. Assume we need to sort them into two conventional groups with centered bits 00111100 and spread apart bits 11000011. These two bytes are ideal cases of centered and spread bits, the others may not be an ideal case, we may see this byte 01011100 as rather centered than spread and this byte 11000101 as rather spread than centered. So we want a utility that examines each byte and, if byte has 4 positive bits, classify it as centered or spread by assigning measures of belonging to each class, for example, 80% centered and 20% spread.
Elementary logical solutionLet us consider the weight matrix W
Let us call our ideal centered and spread bytes as training sample and multiply it by weight matrix
The result can be interpreted as numerical characteristic of belonging the tested binary vector to a particular class. Positive number is "YES" and negative number is "NO". In order to make this result closer to our intuitive understanding they introduced function SIGMOID(x) = 1/(1 + exp(-x)). It maps every scalar to a scale (0,1). When we say we apply SIGMOID to a vector or matrix that means we apply it to its every element. After application of SIGMOID to result we get matrix
If we round result matrix it turns into identity matrix. If we make left multiplication of identity matrix with weight matrix and apply SIGMOID to result we'll get approximate training sample. So the following is true with stipulation that equalities are approximate
SIGMOID( W * T) = I
SIGMOID( I * W) = TT
where W is weight matrix, T is training sample, I is indicator of which class our sample belong to and superscript denotes a transpose operation.
In this example the elements of weight matrix were logically assigned, but they supposed to be built on collected data. We show how to do that in the next section.
Contrastive DivergenceWe can see that properly built weight matrix solved the problem of classification of training sample and therefore can be used for quantitative evaluation of other data. There are several algorithms that lead to building of weight matrix via multiple formal steps. One of them is Constructive Divergence.
Let us first examine our data. We have two matrices, which are training sample and class identity matrix
The standard terminology used in RBM is to call elements of training vectors as visible units, because we know them and elements of class identity matrix as hidden units, because they are unknown upfront. Anyone familiar with linear algebra can see the similarity with finding of left pseudoinverse T+ T = I , which is known for a century. I did not compare myself what is more effective contrastive divergence or pseudoinverse, I only point to strong similarity.
In our case we have 8 visible units, which may take values 0 or 1 and two hidden classes. We need to find weight matrix that computes numerical ratio of belonging the provided combination of values in visible units to a class. We know that sample 11000011 belong to class 1 and sample 00111100 belong to class 2.
The contrastive divergence algorithm is elementary:
I considered only small, but critical for understanding, part of RBM. I recommend to read more. I did not follow standard terminology only to make understanding of RBM easy. I recommend to use standard terminology, of course. The other interesting part of RBM (not addressed here) is unsupervised learning, when large set of data is available but there is no training sample. I hope after this elementary introduction the other materials about RBM will become clear for a new to the subject reader.
Is RBM really new?When we have training sample, the efficiency of the method lays in coefficients of the weight matrix. So it is not clear why to bother with RBM. We have known binary matrix of training data and we have known matrix with positive or negative ones indicating where each individual column of training data belong to, as it is shown below. We can apply many utilities of linear algebra packages to find W, such as finding pseudoinverse or singular value decomposition, or elementary least squares with regularization to smooth elements of W, or other.
When W is found we can apply SIGMOID to result and declare it RBM.
Andrew Polar, Dec, 2017.