Median Tree
Comparison of two distributions
This article is about a new metric for comparison of two samples.
When I tried to compare two distributions for two different data samples, I faced the problem. None of the found in literature metrics
was as simple, clear and informative as I expected. However, I will not criticize neither of them here, since the goal of this article
is only to present my metric.
First of all, I say what I need. I have one sample, obtained from probabilistic model, it is relatively short, let say 50 to 500 items
and I have Monte Carlo simulation result, which can be as large as PC allows, but the reasonable size is 4,000 to 10,000. I don't have any
assumptions about possible distributions, so they can be multimodal (having several peaks), exponential, practically anything and very
far from normal or other bell shaped.
I refrain from criticism of other metrics, as promised, but note only what is wrong with two popular metrics Kolmogorov-Smirnov Test (KST) and
Kullback-Leibler Divergence (KLD).
KST is comparison of maximum distance between two cumulative distributions with table values with conclusion of "passed/failed" type. It depends on the
sizes of samples. It may fail for the sample of 100 items but pass for random sub-sample from it of the half size. When failed it may
be only few percent off the passing limit, and by insignificant changes in the model I can alter the result and more tests start passing.
So, it allows manipulation by driving the tests into wanted result.
KLD for the data samples is a comparison of histograms. The result depends on the bins, which must be chosen for both samples in the same
way, but when one sample is significantly smaller, the quantities in some bins becoming small and unstable.
The suggested test is a Median Tree. We find medians for each of two samples, then two pairs of nested medians inside divided left and right parts.
Then nested medians for each of four divided parts and so on. The number of such divisive steps is called DEPTH. Sorted medians form two
vectors X, Y, which proximity is evaluated by relative distance R
$$ R = \dfrac {2 || X - Y ||} {|| X || + || Y ||} $$
Theoretical limits are [0,2], but for two random vectors they are usually near 1, so it can conventionally be assumed that the measure limits are [0,1].
Numerical examples
For this DEMO I chose bimodal distribution generated by two dice sets and probabilistic switch. Let say, for example, one set has 5 dice and another set has 9 dice.
If we assign the probability of choosing first set as 0.4 and the other as opposite event, the distribution density will look like in the picture
below. After dice set is chosen, it is rolled and the sum of outcome values becomes the item in data sample.
For getting the intuitive understanding of this measure I provide quick comparison of several samples. They are taken from two different populations,
we name them A and B. They both are two sets of dice with probabilistic switch and their parameters are shown
in the table below.
Name | Probability of choosing first set | Quantity 1 | Quantity 2 |
A | 0.4 | 5 | 9 |
B | 0.6 | 6 | 8 |
The median trees for samples of different sizes are shown in the next two tables for depth = 5.
A4000 | 12, 14, 14, 15, 16, 17, 18, 19, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 29, 30, 31, 32, 33, 33, 34, 35, 36, 37, 38, 40 |
A200 | 13, 14, 15, 15, 17, 18, 18, 19, 20, 20, 22, 23, 23, 25, 26, 27, 28, 29, 30, 30, 31, 32, 32, 32, 33, 34, 34, 35, 37, 39, 40 |
B4000 | 14, 16, 17, 17, 18, 18, 19, 20, 20, 21, 21, 22, 22, 22, 23, 23, 24, 24, 25, 25, 26, 26, 27, 27, 28, 29, 30, 31, 32, 33, 35 |
B200 | 14, 16, 16, 16, 17, 18, 19, 19, 19, 20, 20, 21, 21, 22, 22, 23, 23, 24, 24, 25, 25, 26, 27, 28, 28, 29, 29, 31, 32, 33, 35 |
Here are relative distances:
R(A4000, A200) = 0.028
R(B4000, B200) = 0.029
R(A4000, B200) = 0.134
R(B4000, A200) = 0.135
The results clearly show the differences when samples are taken from different populations independently of the sizes, although the distributions for these different populations are not
significantly different. The results are very stable. If to generate another random data with the same parameters, the relative distances may not even show the difference for the
the first three decimal digits shown above.
| |