Content

Divisive data resorting (DDR)
Reference to archived article DDR.

DDR is a quick and easy algorithm of building stochastic models, capable to recognize multimodal distribution
of input dependent targets. It is suggested as alternative to Bayesian neural networks.
The training data should be set of independent records $[X^i, y_i]$, where $X^i$ are vectors of observed features and
$y_i$ are target scalars. Mathematically, feature vectors are points in multidimmensional
real space $X^i \in \mathbb{R}^m$. For an explanation purpose we can imagine multiple non intersecting hyperspheres
sitting on the short distances from each other. Assume each of these hyperspheres contains same number of targets
shown in the picture as points projected into parallel lines in considered hyperspace.
Now we can introduce non intersecting layers, shown by red lines, where each layer covers or lays near same number of points equal to the number of
hyperspheres. The idea of training algorithm is to cover entire input space by non intersecting hypersheres with the same number of
targets, introduce non intersecting layers with equal number of target values and build individual models
for each layer.
If target points within each hyperspere are samples from input dependent distributions and these distributions
are gradually changing, forming certain pattern, the ensemble of models built in this way can
be used for prediction of distributions of targets for the inputs not used in the training process.
The actual training algorithm does not divide input space into adjacent hyperspheres with the same number of points inside them.
The concept with hypersheres was used only for explanation. The actual ensemble of models with layered data subsets is obtained by few
elementary steps.
On first step we build single model $M_{1,1}$ by minimizing residual errors. Then we sort records according to
residual errors $e_i = M_{1,1}(X^i)  y_i$, divide sorted list into two even subsets and build for each subset
two new models $M_{2,1}$ and $M_{2,2}$. Then we apply same dividing and sorting steps for each of these subsets
and obtain $M_{3,1}$, $M_{3,2}$, $M_{3,3}$, $M_{3,4}$. The residual errors significantly decline in each
division step because selected records have smaller error ranges and data set sizes sharply decline. Usually,
third divisive step is the last even for very approximate data.
Generalized DDR
It has been found exprimentally that number of divided blocks can be chosen arbitrarily. For example, we apply first
sorting to entire data set, then divide it into two segments as explained above, build two models and resort data
within each model, but then divide all sorted data into three segments, then into five and so on. Also models
$M_{k,j}$ may not necessarily be applied to disjoints, the data segments may partially overlap. This generalized
DDR sometimes shows insignificant advantages. It allows building larger ensembles.
Numerical simulation
Training and validation sets are generated by Mike's benchmark formula
where $C_j$ are uniformly distributed random values on $[0,1]$, error level $\delta = 0.8$ and $[X_j, y]$ are
observed features and targets. The training data size was 10 000, the validation sample was 100 new tuples.
The accuracy metrics were Pearson correlation coefficients.
The DDR result was compared to
Bayesian neural network
example published on Keras. Keras data set was replaced by Mike's formula.
The changes are elementary and can be applied by those who wish to repeate experiment,
but modified code nontheless is provided for convenience.
Besides plain correlation, the modelled samples were compared to Monte Carlo simulations by KolmogorovSmirnov test.
This test verifies if two samples are drawn from the same population. Since BNN code did not generate samples but
only expectations and variances this KolmogorovSmirnov goodnessoffit test was applied only to DDR predicted distributions.
All results are in the tables. The inputs generated by formula were passed to DDR ensemble and BNN trained
model. Both of them returned expectations and variances. So called actual results for expectations and variances
were computed by applying formula for this input with different disturbances $C_j$. The tables show Pearson correlation coefficients
expressed in percents for 100 validation inputs.
Expectations  Variances  KS passed tests from 100 
DDR tests with KolmogorovArnold ensemble, execution time 35 sec
99  96  90 
99  95  86 
99  96  88 
Expectations  Variances 
BNN tests, execution time 500 sec
98  92 
99  96 
98  95 
Conclusion
While accuracy is about the same, the execution time for BNN is 14 times longer, plus it does not return acutal samples,
only expectations and variances, so predicted distributions were not compared to socalled actual (Monte Carlo simulated).
The different rows of the tables hold results for different executions of the programs.
Working with C# code is much easiear than with Python libraries, all computational routines are in code and not deeply
hidden somewhere in Python libraries. The entire DDR logic + KolmogorovArnold modelling, including internal identification of
multiple Urysohn operators arranged in a tree takes near 700 lines of C# code using only default builtin types.
The project is published.
 
