Content




Stochastic models for regression and classification

Trivial data

While very tallented medical doctors doing heart transplantation, the majority of the people do not need them, because they are having hernias, appendicities, broken fingers, gallstones and other simple reasons for seeing doctors. Same things are happening to data set complexities. Most of the real life data are trivial. They have some errors making exact model impossible, but it is hard to tell what kind of uncertainty it is, epistemic or aleatoric even for an expert in the field. The errors do not depend on inputs and their probability densities all are approximate bell shapes. Applying any sophisticated algorithms for such data is the same thing as asking heart transplantation expert to see a man who hit his finger with the hammer.

Very simple so-called surrogate models perfectly work with trivial data and an attempt to apply something more complex leads to much worse result.

Non-trivial data

The data that justify application of tricky methods published in high rated research magazines must have few distinctive properties:
  • The errors depend on inputs.
  • The distributions of errors for at least some inputs are multimodal.
  • There is a gradual change in error distributions with gradual change of inputs.
That simply means that, two remote inputs $X^k$ and $X^m$ may have any probability density functions $f_k(y)$, $f_m(y)$ (like in the images below),
but for any two close inputs $||X^i - X^j|| < \varepsilon$ we expect two close corresponding densities $||f_i(y) - f_j(y)|| < \delta$ (where $||.||$ is selected metric).

Obviously, to verify above conditions for every given data set is simply not possible, but, since the natural phenomenon for modelled data is known, it is possible to assume if these conditions apply.

This site offers a new method for stochastic modelling, which is theoretically capable to identify multimodality in distributions of targets, we call it Divisive Data Resorting.