Multimodal posterior distribution

When methods like Bayesian Neural Networks (BNN) and Divisive Data Resorting (DDR) return distribution parameters or samples of outputs for the inputs, not used in training, we wish to know how accurate they are. The experimental data or recorded observations of the physical systems usually have all different inputs, so we can't use them even to assess accuracies of returned expectations not speaking about distributions. For comparison we need so-called true or actual distributions, and the only way to get them is to use synthetic data. This generated data must be challenging enough to expose weaknesses and strengths of the models. The word challenging means that outputs should have not normal and not unimodal distributions which must vary significantly for the different inputs.

The simulating formua, answering such needs, is derived by the authors of this site:

where $C_j$ are uniformly distributed random values on $[0,1]$, $X_j$ are observed or given values, $X^*_j$ are unobserved values used in computation of the outputs $y$. Parameter $\delta$ is an error level, when it is 0.0, the system becomes deterministic (we use $\delta = 0.8$ in experiments).

Below are examples of two histograms for the different inputs:

The histograms are built by repeated recalculation of $y$, which defines the type of uncertainty as aleatoric. There is also a gradual change in the distributions of output depending on input. It means that for the small changes in observed values $X_j$, the changes in distribution of the output are also small. But distributions for significanlty different inputs can also be significantly different.

Bayesian Neural Network test

For BNN test we used published code sample provided by Keras as an example. The original data Wine quality, used in example, was replaced by 10,000 records generated by above formula. This slightly modified Python code can be found in author's repository.

The only change to the main code, that we had to do, was to address expected multimodality in posterior distributions. The following two operations of original code:

distribution_params = layers.Dense(units=2)(features)     
outputs = tfp.layers.IndependentNormal(1)(distribution_params)
narrow the output to a single normal distribution with the best fitted expectation and standard deviation. Assignment (units=2) defines the number of estimated parameters. This part was changed according to tensorFlow documentation into mixture of four normal distribtions with weights in the following way:

distribution_params = layers.Dense(units=12)(features)     
outputs = tfp.layers.DistributionLambda(lambda t:        
      cat = tfp.distributions.Categorical(logits=t[:, :4]),
      components = 
         tfp.distributions.Normal(loc=t[:, 4], scale=tf.math.softplus(t[:, 5])),
         tfp.distributions.Normal(loc=t[:, 6], scale=tf.math.softplus(t[:, 7])),
         tfp.distributions.Normal(loc=t[:, 8], scale=tf.math.softplus(t[:, 9])),
         tfp.distributions.Normal(loc=t[:, 10], scale=tf.math.softplus(t[:, 11]))
The number of estimated parameters becomes (units=12), four expectations, four standard deviations and four weights. TensorFlow library is very user unfriendly with very poor documentation. The proper explanation of this replacement and how it works needs multiple pages even for a person familiar with Python, so we skip that part, since it is out of the scope of this article.

The best explanation I read so far can be found on the following link, but even that is shallow and incomplete.

After training is completed, the model is tested by 100 unseen random inputs. It returns the sample of 1024 possible output values for each input. These samples are compared to conventionally true values, generated by Monte Carlo using the formula at the top. Since we compare 100 arrays to another 100 arrays, we need to define statistical indicators and accuracy metric. First set of values are expectations and variances. The accuracy metric is Pearson correlation coefficient, here is the table with results for eight executions of the code:

Pearson correlation for 100 unseen inputs
Variances 0.840.850.710.840.800.760.870.81

We can see a perfect match for expectations and not so good match for variances. However, the goal of BNN is to obtain the distribtions. For them we've chosen Kullback-Leibler (KL) divergence measure. We use both modelled and Monte Carlo samples for building 5 bar histograms, then we apply KL-divergence measure for each pair and compute mean value for 90% of best results as a single accuracy measure.

This choice may need some explanation. All records in the training sample have different inputs. When model produce the output sample for unseen input, it may not perfectly match. From the perspective of the user of the model, it is good enough to know certain statistical properties of the possible distribution, such as multimodality and skewness. That can be seen clearly on 5-bar histograms if they are accurate. KL-divergence for exact match is 0.0 but it can exceed 1.0 for random or inaccurate pairs, so few failed cases may spoil the measure for 100 KL values, and on that reason 90% of best results are considered and average value is shown in the table below. Eight values are computed for eight executions of the code.

Mean KL-divergence for the best 90% results

Accuracy near 5% is a very good match. Below we show two 5 bar historgrams with KL-divergence 0.05 for visualization

Two histograms with KL-divergence = 0.05

Direct modelling of variance

Predicting only expectations and variances by Bayesian Neural network, as it is in original code,

distribution_params = layers.Dense(units=2)(features)     
outputs = tfp.layers.IndependentNormal(1)(distribution_params)
is an obvious overengineering. These values can be estimated by straight-forward modelling with two deterministic models. The expectation values are obtained in a traditional way by a single deterministic model $M_E$ using minimization of residual errors $e_i$

$$e_i = [y_i - M_E(X^i)]^2.$$
Variances can be computed also by a single deterministic model $M_V$ for a new output values $v_i$
$$v_i = [y_i - M_E(X^i)]^2.$$ also by minimization of residuals
$$e_i = [v_i - M_V(X^i)]^2.$$
Below is a result of using such approach:

Pearson correlation for 100 unseen inputs for two models approach
Variances 0.960.980.970.970.980.980.970.98

The deterministic component for both models was Kolmogorov-Arnold representation (details for the training of this model can be found in published paper)

$$ M(x_1, x_2, x_3, ... , x_n) = \sum_{q=0}^{2n} \Phi_q\left(\sum_{p=1}^{n} \phi_{q,p}(x_{p})\right).$$

We can see that two deterministic models provide significantly more accurate estimation for expectations and variances compared to Bayesian Neural Network (BNN). Each case when output (or target) of BNN is modelled as normal distribution can be simply replaced by two model solution.

Note that such accurate solution is obtained for non-trivial and very challenging data. Most of the published datasets are not that hard. When output distributions are simple bell shape curves and not significanlty depend on inputs, such elementary approach gives almost 100% accuracy. Anyone can easily verify this statement. The formula at the top can be modified into producing unimodal (not necessarily normal and symmetric) distributions, and the accuracy in prediction for expectations and variances, after this, raise to near 100%.

Divisive Data Resorting test

The deterministic component was Kolmogorov-Arnold representation. The theoretical explanation is on the previous page. The source code can be found in author's repository. The test results are shown in the table below.

Test results for 8 executions of DDR code
Expectations 0.990.990.990.990.990.990.990.99
Variances 0.950.960.950.970.930.930.970.93
KL divergences0.0320.0470.0370.0580.0610.0530.0660.047

The tests were conducted in the same way as for BNN, the accuracy metrics are the same.


On unknown to me reasons I could not find online even a single complete example for recognition of multimodal properties in posterior distributions for Keras or tensorFlow. All examples, I found, predict distribution of targets as a single normal. We don't need probabilistic programming to predict variance as an only characteristic of uncertainty. The searches for "mulitmodal posteriors" return long lists of theoretical articles with no clear answer or working code, although it is not even a problem, it only a choice of the function in Keras or tensorFlow representing top layer.

The example of Divisive Data Resorting, which works with about the same accuracy as BNN, is written as DEMO without code optimization for the fast execution for the purpose of clear demonstration of the concept. Even in the current form it is about 6 times faster than BNN and its concept allows parallel execution, so when applied, it can be 10 to 20 times faster. It has about 700 lines of C# code which uses only built in types without any additional special library. The concept of DDR allows using any deterministic model as an element of stochastic model, it may use neural network but not need it.