Content

Divisive data resorting (DDR)
Reference to archived article DDR.

DDR is a quick and easy algorithm of building stochastic models, capable to recognize multimodal distribution
of input dependent targets. It is suggested as alternative to Bayesian neural networks.
The training data should be set of independent records $[X^i, y_i]$, where $X^i$ are vectors of observed features and
$y_i$ are target scalars. Mathematically, feature vectors are points in multidimmensional
real space $X^i \in \mathbb{R}^m$. For an explanation purpose we can imagine multiple non intersecting hyperspheres
sitting on the short distances from each other. Assume each of these hyperspheres contains same number of targets
shown in the picture as points projected into parallel lines in considered hyperspace.
Now we can introduce non intersecting layers, shown by red lines, where each layer covers or lays near same number of points equal to the number of
hyperspheres. The idea of training algorithm is to cover entire input space by non intersecting hypersheres with the same number of
targets, introduce non intersecting layers with equal number of target values and build individual models
for each layer.
If target points within each hyperspere are samples from input dependent distributions and these distributions
are gradually changing, forming certain pattern, the ensemble of models built in this way can
be used for prediction of distributions of targets for the inputs not used in the training process.
The actual training algorithm does not divide input space into adjacent hyperspheres with the same number of points inside them.
The concept with hypersheres was used only for explanation. The actual ensemble of models with layered data subsets is obtained by few
elementary steps.
On first step we build single model $M_{1,1}$ by minimizing residual errors. Then we sort records according to
residual errors $e_i = M_{1,1}(X^i)  y_i$, divide sorted list into two even subsets and build for each subset
two new models $M_{2,1}$ and $M_{2,2}$. Then we apply same dividing and sorting steps for each of these subsets
and obtain $M_{3,1}$, $M_{3,2}$, $M_{3,3}$, $M_{3,4}$. The residual errors significantly decline in each
division step because selected records have smaller error ranges and data set sizes sharply decline. Usually,
third divisive step is the last even for very approximate data.
Generalized DDR
It has been found exprimentally that number of divided blocks can be chosen arbitrarily. For example, we apply first
sorting to entire data set, then divide it into two segments as explained above, build two models and resort data
within each model, but then divide all sorted data into three segments, then into five and so on. Also models
$M_{k,j}$ may not necessarily be applied to disjoints, the data segments may partially overlap. This generalized
DDR sometimes shows insignificant advantages. It allows building larger ensembles.
Numerical simulation
Numerical example, coding sample and comparison to Bayesian Neural Network is shown in the next page.
 
