Divisive data resorting (DDR)

Reference to archived article DDR.

DDR is a quick and easy algorithm of building stochastic models, capable to recognize multimodal distribution of input dependent targets. It is suggested as alternative to Bayesian neural networks.

The training data should be set of independent records $[X^i, y_i]$, where $X^i$ are vectors of observed features and $y_i$ are target scalars. Mathematically, feature vectors are points in multidimmensional real space $X^i \in \mathbb{R}^m$. For an explanation purpose we can imagine multiple non intersecting hyperspheres sitting on the short distances from each other. Assume each of these hyperspheres contains same number of targets shown in the picture as points projected into parallel lines in considered hyperspace.
Now we can introduce non intersecting layers, shown by red lines, where each layer covers or lays near same number of points equal to the number of hyperspheres. The idea of training algorithm is to cover entire input space by non intersecting hypersheres with the same number of targets, introduce non intersecting layers with equal number of target values and build individual models for each layer.

If target points within each hyperspere are samples from input dependent distributions and these distributions are gradually changing, forming certain pattern, the ensemble of models built in this way can be used for prediction of distributions of targets for the inputs not used in the training process.

The actual training algorithm does not divide input space into adjacent hyperspheres with the same number of points inside them. The concept with hypersheres was used only for explanation. The actual ensemble of models with layered data subsets is obtained by few elementary steps.

On first step we build single model $M_{1,1}$ by minimizing residual errors. Then we sort records according to residual errors $e_i = M_{1,1}(X^i) - y_i$, divide sorted list into two even subsets and build for each subset two new models $M_{2,1}$ and $M_{2,2}$. Then we apply same dividing and sorting steps for each of these subsets and obtain $M_{3,1}$, $M_{3,2}$, $M_{3,3}$, $M_{3,4}$. The residual errors significantly decline in each division step because selected records have smaller error ranges and data set sizes sharply decline. Usually, third divisive step is the last even for very approximate data.

Generalized DDR

It has been found exprimentally that number of divided blocks can be chosen arbitrarily. For example, we apply first sorting to entire data set, then divide it into two segments as explained above, build two models and resort data within each model, but then divide all sorted data into three segments, then into five and so on. Also models $M_{k,j}$ may not necessarily be applied to disjoints, the data segments may partially overlap. This generalized DDR sometimes shows insignificant advantages. It allows building larger ensembles.

Numerical simulation

Numerical example, coding sample and comparison to Bayesian Neural Network is shown in the next page.