Content
|
Stochastic models for regression and classification
Trivial data
While very tallented medical doctors doing heart transplantation, the majority of the people do
not need them, because they are having hernias, appendicities, broken fingers, gallstones
and other simple reasons for seeing doctors. Same things are happening to data set complexities.
Most of the real life data are trivial. They have some errors making exact model impossible, but it
is hard to tell what kind of uncertainty it is, epistemic or aleatoric even for an expert in
the field. The errors do not depend on inputs and their probability densities all are
approximate bell shapes. Applying any sophisticated algorithms for such data is the same thing
as asking heart transplantation expert to see a man who hit his finger with the hammer.
Very simple so-called surrogate models perfectly work with trivial data and an attempt to apply
something more complex leads to much worse result.
Non-trivial data
The data that justify application of tricky methods published in high rated research magazines must have few
distinctive properties:
- The errors depend on inputs.
- The distributions of errors for at least some inputs are multimodal.
- There is a gradual change in error distributions with gradual change of inputs.
That simply means that, two remote inputs $X^k$ and $X^m$ may have any probability density functions $f_k(y)$, $f_m(y)$
(like in the images below),
 |  |
but for any two close inputs $||X^i - X^j|| < \varepsilon$ we expect two close corresponding densities
$||f_i(y) - f_j(y)|| < \delta$ (where $||.||$ is selected metric).
Obviously, to verify above conditions for every given data set is simply not possible, but, since the natural
phenomenon for modelled data is known, it is possible to assume if these conditions apply.
This site offers a new method for stochastic modelling, which is theoretically capable to identify multimodality in distributions
of targets, we call it Divisive Data Resorting.
|
|
|