Content




What is wrong with Bayesian Neural Networks?

The concept of Bayesian neural network (BNN) is simple and straightforward. The weights and biases of neural networks becoming random variables with normal prior distribution, then the posterior distribution is derived by using Bayesian inference. The parameters of found posterior distribution are obtained from the training data and allow generating output samples for the new or unseen inputs. The end goal of this approach is clearly to address the uncertainty which assumed to be epistemic or aleatoric or both.

Epistemic uncertainty

The best way to explain it, is an example. Assume we need a model for computing the area of a triangle given the coordinates of its vertices $x_1, y_1, x_2, y_2, x_3, y_3$.
Anyone who tried to train deep neural network on this data suprisingly found that it is a very challenging task. For training sets with 2000 entries or records the average error is as large as 20%. An accurate model can be built for very large data sets such as 100 000 records. That is an example of epistemic uncertainty. The modelled phenomenon is deterministic and data is exact, but neural networks are not perfect enough.

Aleatoric uncertainty

An example of aleatoric uncertainty can be polling results depending on demographic data. For example, a voting decision as a function of several parameters such as age, sex, income, education, marital status, home ownership and others. It is clear that records with same inputs may result in different outputs but the probabilities may be stable and input dependent.

Unfulfilled promise

The main goal of Bayesian Neural Network is to model uncertainty by returning distribution of the output instead of a point value. That means if we manage to obtain some data with known input dependent distributions, like in the table below:

InputsSample of possible outputs
$x_{1,1}, x_{1,2}, ... x_{1,n}$$y_{1,1}, y_{1,2}, ... y_{1,m}$
$x_{2,1}, x_{2,2}, ... x_{2,n}$$y_{2,1}, y_{2,2}, ... y_{2,m}$
......

and provide only one output value for each record, having no identical inputs in entire training set, BNN non-the-less is expected to identify input dependent distributions. Unfortunately, BNN capabilities is very far from achieving this goal.

Almost every reader can refer to an article where complex input dependent distribution was successfully modelled by BNN. However, in all these cases researchers knew upfront the actual distributions and kept tuning their models and identification procedures until obtaining wanted result, good enough for publishing in a journal.

This is not how the real world functions. The published articles are then used by non experts in the field but qualified engineers who either find and use available libraries or write their own code based on provided in examples open source. The data sets are having only one output and researchers have no idea what the distribution of the outputs might be. The validation samples must have unused in training process inputs and the actual distributions in validation set must not be trivial bell shape curves but possibly Poisson, exponential, multimodal or other.

So the proof of concept should be conducted in the following way: one engineer who was not involved into development of libraries use them on data generated by another enigneer. The validation should be conducted on inputs where actual distributions are available but hidden from a person who conducts modelling. Also, these tested posterior distributions should not be normal and even unimodal.

One angry data set for a challenge

Here is one data generation formula for those who disagree with pessimistic assessment of BNN capabilities. It is derived by mathematician Mike Poluektov


where $C_j$ are uniformly distributed random variables on $[0,1]$, $X_j$ are observed inputs, $X^*_j$ are values used in computation of $y$, parameter $\delta$ defines the level of aleatoric uncertainty. The distributions of the outputs are all different and input dependent. When $\delta = 0$ the model becomes deterministic and valid neural networks usually provide near 100% accuracy. For $\delta = 0.8$ the deterministic models provide accuracy near 75%, which is near many publicly available real life observations. The accuracy mentioned here is simply Pearson correlation coefficients for given $y$ and estimated $\hat{y}$ outputs. Most distributions are asymmetric, some of them are Poisson type, some are multimodal (multiple peaks), some are monotonic. It is, however, possible to train some probabilistic models to recognize input dependent distributions even for this angry formula and some of these examples are provided on this site.

Other pages with constructive criticism of BNN

It was really hard to find any criticism on BNN. This is what I found so far.