Content




Divisive data resorting (DDR)

Reference to archived article DDR.



DDR is a quick and easy algorithm of building stochastic models, capable to recognize multimodal distribution of input dependent targets. It is suggested as alternative to Bayesian neural networks.

The training data should be set of independent records $[X^i, y_i]$, where $X^i$ are vectors of observed features and $y_i$ are target scalars. Mathematically, feature vectors are points in multidimmensional real space $X^i \in \mathbb{R}^m$. For an explanation purpose we can imagine multiple non intersecting hyperspheres sitting on the short distances from each other. Assume each of these hyperspheres contains same number of targets shown in the picture as points projected into parallel lines in considered hyperspace.
Now we can introduce non intersecting layers, shown by red lines, where each layer covers or lays near same number of points equal to the number of hyperspheres. The idea of training algorithm is to cover entire input space by non intersecting hypersheres with the same number of targets, introduce non intersecting layers with equal number of target values and build individual models for each layer.

If target points within each hyperspere are samples from input dependent distributions and these distributions are gradually changing, forming certain pattern, the ensemble of models built in this way can be used for prediction of distributions of targets for the inputs not used in the training process.

The actual training algorithm does not divide input space into adjacent hyperspheres with the same number of points inside them. The concept with hypersheres was used only for explanation. The actual ensemble of models with layered data subsets is obtained by few elementary steps.

On first step we build single model $M_{1,1}$ by minimizing residual errors. Then we sort records according to residual errors $e_i = M_{1,1}(X^i) - y_i$, divide sorted list into two even subsets and build for each subset two new models $M_{2,1}$ and $M_{2,2}$. Then we apply same dividing and sorting steps for each of these subsets and obtain $M_{3,1}$, $M_{3,2}$, $M_{3,3}$, $M_{3,4}$. The residual errors significantly decline in each division step because selected records have smaller error ranges and data set sizes sharply decline. Usually, third divisive step is the last even for very approximate data.

Generalized DDR

It has been found exprimentally that number of divided blocks can be chosen arbitrarily. For example, we apply first sorting to entire data set, then divide it into two segments as explained above, build two models and resort data within each model, but then divide all sorted data into three segments, then into five and so on. Also models $M_{k,j}$ may not necessarily be applied to disjoints, the data segments may partially overlap. This generalized DDR sometimes shows insignificant advantages. It allows building larger ensembles.

Numerical simulation

Training and validation sets are generated by Mike's benchmark formula


where $C_j$ are uniformly distributed random values on $[0,1]$, error level $\delta = 0.8$ and $[X_j, y]$ are observed features and targets. The training data size was 10 000, the validation sample was 100 new tuples. The accuracy metrics were Pearson correlation coefficients.

The DDR result was compared to Bayesian neural network example published on Keras. Keras data set was replaced by Mike's formula. The changes are elementary and can be applied by those who wish to repeate experiment, but modified code non-the-less is provided for convenience.

Besides plain correlation, the modelled samples were compared to Monte Carlo simulations by Kolmogorov-Smirnov test. This test verifies if two samples are drawn from the same population. Since BNN code did not generate samples but only expectations and variances this Kolmogorov-Smirnov goodness-of-fit test was applied only to DDR predicted distributions. All results are in the tables. The inputs generated by formula were passed to DDR ensemble and BNN trained model. Both of them returned expectations and variances. So called actual results for expectations and variances were computed by applying formula for this input with different disturbances $C_j$. The tables show Pearson correlation coefficients expressed in percents for 100 validation inputs.

ExpectationsVariancesKS passed tests from 100
DDR tests with Kolmogorov-Arnold ensemble, execution time 35 sec
999690
999586
999688


ExpectationsVariances
BNN tests, execution time 500 sec
9892
9996
9895


Conclusion

While accuracy is about the same, the execution time for BNN is 14 times longer, plus it does not return acutal samples, only expectations and variances, so predicted distributions were not compared to so-called actual (Monte Carlo simulated). The different rows of the tables hold results for different executions of the programs. Working with C# code is much easiear than with Python libraries, all computational routines are in code and not deeply hidden somewhere in Python libraries. The entire DDR logic + Kolmogorov-Arnold modelling, including internal identification of multiple Urysohn operators arranged in a tree takes near 700 lines of C# code using only default built-in types. The project is published.