What is wrong with Bayesian Neural Networks?
The concept of Bayesian neural network (BNN) is simple and straightforward. The weights and biases
of neural networks becoming random variables with normal prior distribution,
then the posterior distribution is derived by using Bayesian inference.
The parameters of found posterior distribution are obtained from the training data and allow
generating output samples for the new or unseen inputs. The end goal of this approach is
clearly to address the uncertainty which assumed to be epistemic
or aleatoric or both.
Epistemic uncertainty
The best way to explain it, is an example. Assume we need a model for computing the area of a triangle
given the coordinates of its vertices $x_1, y_1, x_2, y_2, x_3, y_3$.
Anyone who tried to train deep neural network on this data suprisingly found that it is a very challenging
task. For training sets with 2000 entries or records the average error is as large as 20%. An accurate
model can be built for very large data sets such as 100 000 records. That is an example of epistemic uncertainty.
The modelled phenomenon is deterministic and data is exact, but neural networks are not perfect enough.
Aleatoric uncertainty
An example of aleatoric uncertainty can be polling results depending on demographic data. For example, a voting
decision as a function of several parameters such as age, sex, income, education, marital status, home ownership and others.
It is clear that records with same inputs may result in different outputs but the probabilities may be
stable and input dependent.
Unfulfilled promise
The main goal of Bayesian Neural Network is to model uncertainty by returning distribution of the output instead of a point value.
That means if we manage to obtain some data with known input dependent distributions, like in the table below:
Inputs | Sample of possible outputs |
$x_{1,1}, x_{1,2}, ... x_{1,n}$ | $y_{1,1}, y_{1,2}, ... y_{1,m}$ |
$x_{2,1}, x_{2,2}, ... x_{2,n}$ | $y_{2,1}, y_{2,2}, ... y_{2,m}$ |
... | ... |
and provide only one output value for each record, having no identical inputs in entire training set, BNN non-the-less
is expected to identify input dependent distributions.
Unfortunately, BNN capabilities is very far from achieving this goal.
Almost every reader can refer to an article where complex input dependent distribution was successfully modelled by BNN.
However, in all these cases researchers knew upfront the actual distributions
and kept tuning their models and identification procedures until obtaining wanted result, good enough for publishing in a
journal.
This is not how the real world functions. The published articles are then used by non experts in the field but qualified engineers who
either find and use available libraries or write their own code based on provided in examples open source. The data sets
are having only one output and researchers have no idea what the distribution of the outputs might be. The validation
samples must have unused in training process inputs and the actual distributions in validation set must not be trivial bell shape
curves but possibly Poisson, exponential, multimodal or other.
So the proof of concept should be conducted in the following way: one engineer who was not involved into development of
libraries use them on data generated by another enigneer. The validation should be conducted on inputs where actual
distributions are available but hidden from a person who conducts modelling. Also, these tested posterior distributions should
not be normal and even unimodal.
One angry data set for a challenge
Here is one data generation formula for those who disagree with pessimistic assessment of BNN capabilities. It is derived by
mathematician Mike Poluektov

where $C_j$ are uniformly distributed random variables on $[0,1]$, $X_j$ are observed inputs,
$X^*_j$ are values used in computation of $y$, parameter $\delta$ defines the level of aleatoric uncertainty. The distributions of the outputs are all different
and input dependent. When $\delta = 0$ the model becomes deterministic and valid neural networks usually provide near 100% accuracy.
For $\delta = 0.8$ the deterministic models provide accuracy near 75%, which is near many publicly available real life observations. The accuracy
mentioned here is simply Pearson correlation coefficients for given $y$ and estimated $\hat{y}$ outputs.
Most distributions are asymmetric, some of them are Poisson type, some are multimodal (multiple peaks), some are monotonic. It is, however,
possible to train
some probabilistic models to recognize input dependent distributions even for this angry formula and some of these
examples are provided on this site.
Other pages with constructive criticism of BNN
It was really hard to find any criticism on BNN. This is what I found so far.
| |