Content




Second failed benchmark

Published benchmarks examples, usually, have two components: code and data. The data, obviously, must be kind of challenging, exposing code strength, but they are not always chosen that way. This example is taken from Keras - another big and well maintained collection of source code for deep machine learning.

The data used in this Keras example is Wine Quality. It has 4 898 records with 11 observed features. The target is quality score 0,1,...10. Since this data is experimental and only one output value is available for each record, the predicted distribution can't be compared to actual one and accuracy can't be assessed. So, in order to assess accuracy we replaced Wine Quality Data by synthetic data:


where $C_j$ are uniformly distributed random values on $[0,1]$, $X_j$ are observed values, $X^*_j$ are used in computation of the outputs $y$, parameter $\delta$ is an error level. When $\delta=0$, the system becomes deterministic and can be modelled with near 100% accuracy by neural network, so we are dealing with aleatoric uncertainty only. The generated data set size was 10 000 records.

This formula was designed by mathematician Mike Poluektov. I call it Mike's benchmark data set. Recalculation of outputs with different random terms $C_j$ allows estimation of probability densities for $y$ and they all are complex enough and depend on observed inputs $X_j$.

The code from Keras provides only expectations and standard deviations. So I replaced experimental Wine Quality Data and compared expectations and standard deviations returned by model to so called 'actual' or Monte Carlo simulated values.

The error level was $\delta = 0.8$. Single deterministic model for such data gives accuracy near $75\%$. The accuracy for returned expectations was near $98\%$ and for standard deviations near $92\%$. The used accuracy metric was Pearson correlation coefficient. The modified code can be found in my repository backup location, but it is almost same as original, only data is replaced and accuracy assessment is added.

Is it success or failure?

It may look like a good result. The data is very complex, single deterministic model is very inaccurate, correlation for estimated $\hat{y}$ and given $y$ outputs for single model is only $75\%$, but BNN returns expectations with $98\%$ accuracy and standard deviations with $92\%$ accuracy.

Ok, in the next section I will explain why this result is very weak.

'Vandalization' of the beautiful picture

My idea in counter example was to model variance by another deterministic model and compare accuracy to BNN. Assume we built single deterministic expectation model $M_E$ by minimizing residual errors and it provides estimated outputs $\hat{y}$. Now we can compute squared errors for each individual input $e_i = (M_E(X_i) - y_i)^2$ and build another model $M_V$ using $e_i$ as new targets.

The choice of the models $M_E, M_V$ should not necessarily be a neural network. I chose here the one I was using for several years in research $$ M(x_1, x_2, x_3, ... , x_n) = \sum_{q=0}^{2n} \Phi_q\left(\sum_{p=1}^{n} \phi_{q,p}(x_{p})\right), $$ which is Kolmogorov-Arnold representation. This model will be explained in details further in this site. For this moment I only state that its accuracy in many comparison tests appeared to be near neural networks but training needs much less time. The functions $\Phi_q, \phi_{q,p}$ are not specified prior to training, their shapes is fully determined in the training process.

Now I simply announce the end result for Kolmogorov-Arnold model. The accuracy for expectation is 98% and for variance 92%. Two deterministic models did the same job. The code can be found in my storage, the project name is residual.

Using two deterministic models for uncertainty estimation, obviously, is much more quick and reliable process. When technology is significantly more complex, it should bring certain advantages, which I did not notice in this example. Needless to say that training two deterministic models took few seconds and using BNN near two minutes on the same machine.