Multimodal posterior distribution
When methods like Bayesian Neural Networks (BNN) and Divisive Data Resorting (DDR) return distribution parameters
or samples of outputs for the inputs, not used in training, we wish to know how accurate they are. The experimental
data or recorded observations of the physical systems usually have all different inputs, so we can't use them even
to assess accuracies of returned expectations not speaking about distributions. For comparison we need
so-called true or actual distributions,
and the only way to get them is to use synthetic data. This generated data must be challenging enough to expose weaknesses
and strengths of the models. The word challenging means that outputs should have not normal and not unimodal distributions
which must vary significantly for the different inputs.
The simulating formua, answering such needs, is derived by the authors of this site:

where $C_j$ are uniformly distributed random values on $[0,1]$, $X_j$ are observed or given values, $X^*_j$
are unobserved values used in computation of the outputs $y$. Parameter $\delta$ is an error level, when it is 0.0, the system becomes deterministic (we use $\delta = 0.8$ in
experiments).
Below are examples of two histograms for the different inputs:


The histograms are built by repeated recalculation of $y$, which defines the type of uncertainty as aleatoric. There is also
a gradual change in the distributions of output depending on input. It means that for the small changes in observed values $X_j$,
the changes in distribution of the output are also small. But distributions for significanlty different inputs can also be significantly
different.
Bayesian Neural Network test
For BNN test we used published code sample provided by
Keras as an example.
The original data Wine quality, used in
example, was replaced by
10,000 records generated by above formula. This slightly modified Python code can be found
in author's repository.
The only change to the main code, that we had to do, was to address expected multimodality in posterior distributions. The following two
operations of original code:
distribution_params = layers.Dense(units=2)(features)
outputs = tfp.layers.IndependentNormal(1)(distribution_params)
narrow the output to a single normal distribution with the best fitted expectation and standard deviation. Assignment
(units=2) defines the number of estimated parameters. This part was changed according to tensorFlow documentation into
mixture of four normal distribtions with weights in the following way:
distribution_params = layers.Dense(units=12)(features)
outputs = tfp.layers.DistributionLambda(lambda t:
tfp.distributions.Mixture(
cat = tfp.distributions.Categorical(logits=t[:, :4]),
components =
[
tfp.distributions.Normal(loc=t[:, 4], scale=tf.math.softplus(t[:, 5])),
tfp.distributions.Normal(loc=t[:, 6], scale=tf.math.softplus(t[:, 7])),
tfp.distributions.Normal(loc=t[:, 8], scale=tf.math.softplus(t[:, 9])),
tfp.distributions.Normal(loc=t[:, 10], scale=tf.math.softplus(t[:, 11]))
]
)
)(distribution_params)
The number of estimated parameters becomes (units=12), four expectations, four standard deviations and four weights.
TensorFlow library is very user unfriendly with very poor documentation. The proper explanation of this replacement
and how it works needs multiple pages even for a person familiar with Python, so we skip that part, since it is
out of the scope of this article.
The best explanation I read so far can be found on the following
link,
but even that is shallow and incomplete.
After training is completed, the model is tested by 100 unseen random inputs. It returns the sample of 1024 possible output values for each input.
These samples are compared to conventionally true values, generated by Monte Carlo using the formula at the top. Since we compare 100
arrays to another 100 arrays, we need to define statistical indicators and accuracy metric. First set of values are expectations
and variances. The accuracy metric is Pearson correlation coefficient, here is the table with results for eight executions of the code:
Pearson correlation for 100 unseen inputs
Expectations | 0.98 | 0.97 | 0.98 | 0.98 | 0.96 | 0.98 | 0.98 | 0.98 |
Variances | 0.84 | 0.85 | 0.71 | 0.84 | 0.80 | 0.76 | 0.87 | 0.81 |
We can see a perfect match for expectations and not so good match for variances. However, the goal of BNN is to obtain the distribtions.
For them we've chosen Kullback-Leibler (KL) divergence measure. We use both modelled and Monte Carlo samples for building 5 bar histograms,
then we apply KL-divergence measure for each pair and compute mean value for 90% of best results as a single accuracy measure.
This choice may need some explanation. All records
in the training sample have different inputs. When model produce the output sample for unseen input, it may not perfectly match. From
the perspective of the user of the model, it is good enough to know certain statistical properties of the possible distribution,
such as multimodality and skewness. That can be seen clearly on 5-bar histograms if they are accurate. KL-divergence for exact
match is 0.0 but it can exceed 1.0 for random or inaccurate pairs, so few failed cases may spoil the measure for 100 KL values,
and on that reason 90% of best results are considered and average value is shown in the table below. Eight values are computed for
eight executions of the code.
Mean KL-divergence for the best 90% results
KL-divergence | 0.029 | 0.037 | 0.070 | 0.031 | 0.060 | 0.077 | 0.029 | 0.040 |
Accuracy near 5% is a very good match. Below we show two 5 bar historgrams with KL-divergence 0.05 for visualization
Two histograms with KL-divergence = 0.05
 |  |
Direct modelling of variance
Predicting only expectations and variances by Bayesian Neural network, as it is in original code,
distribution_params = layers.Dense(units=2)(features)
outputs = tfp.layers.IndependentNormal(1)(distribution_params)
is an obvious overengineering. These values
can be estimated by straight-forward modelling with two deterministic models.
The expectation values are obtained in a traditional way by a single deterministic model $M_E$ using minimization of residual errors $e_i$
$$e_i = [y_i - M_E(X^i)]^2.$$
Variances can be computed also by a single deterministic model $M_V$ for a new output values $v_i$
$$v_i = [y_i - M_E(X^i)]^2.$$
also by minimization of residuals
$$e_i = [v_i - M_V(X^i)]^2.$$
Below is a result of using such approach:
Pearson correlation for 100 unseen inputs for two models approach
Expectations | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 |
Variances | 0.96 | 0.98 | 0.97 | 0.97 | 0.98 | 0.98 | 0.97 | 0.98 |
The deterministic component for both models was Kolmogorov-Arnold representation
(details for the training of this model can be found in
published paper)
$$ M(x_1, x_2, x_3, ... , x_n) = \sum_{q=0}^{2n} \Phi_q\left(\sum_{p=1}^{n} \phi_{q,p}(x_{p})\right).$$
We can see that two deterministic models provide significantly more accurate estimation for expectations and variances compared to Bayesian Neural Network (BNN).
Each case when output (or target) of BNN is modelled as normal distribution can be simply replaced by two model solution.
Note that such accurate solution is obtained for non-trivial and very challenging data. Most of the published datasets are not that hard. When output distributions
are simple bell shape curves and not significanlty depend on inputs, such elementary approach gives almost 100% accuracy. Anyone can easily verify this statement.
The formula at the top can be modified into producing unimodal (not necessarily normal and symmetric) distributions, and the accuracy in prediction for
expectations and variances, after this, raise to near 100%.
Divisive Data Resorting test
The deterministic component was Kolmogorov-Arnold representation. The theoretical explanation is on the
previous page.
The source code can be found in
author's repository. The test results are shown in the table below.
Test results for 8 executions of DDR code
Expectations | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 |
Variances | 0.95 | 0.96 | 0.95 | 0.97 | 0.93 | 0.93 | 0.97 | 0.93 |
KL divergences | 0.032 | 0.047 | 0.037 | 0.058 | 0.061 | 0.053 | 0.066 | 0.047 |
The tests were conducted in the same way as for BNN, the accuracy metrics are the same.
Conclusion
On unknown to me reasons I could not find online even a single complete example for recognition of multimodal properties in posterior distributions for Keras or tensorFlow.
All examples, I found, predict distribution of targets as a single normal. We don't need probabilistic programming to predict variance as an only characteristic of uncertainty.
The searches for "mulitmodal posteriors" return long lists of theoretical articles with no clear answer or working code, although it is not even a problem,
it only a choice of the function in Keras or tensorFlow representing top layer.
The example of Divisive Data Resorting, which works with about the same accuracy as BNN, is written as DEMO without code optimization for the fast
execution for the purpose of clear demonstration of the concept. Even in the current form it is about 6 times faster than BNN and its concept allows parallel execution,
so when applied, it can be 10 to 20 times faster. It has about 700 lines of C# code which uses only built in types without any additional special library.
The concept of DDR allows using any deterministic model as an element of stochastic model, it may use neural network but not need it.
| |