Content




Median Tree

Comparison of two distributions

This article is about a new metric for comparison of two samples.

When I tried to compare two distributions for two different data samples, I faced the problem. None of the found in literature metrics was as simple, clear and informative as I expected. However, I will not criticize neither of them here, since the goal of this article is only to present my metric.

First of all, I say what I need. I have one sample, obtained from probabilistic model, it is relatively short, let say 50 to 500 items and I have Monte Carlo simulation result, which can be as large as PC allows, but the reasonable size is 4,000 to 10,000. I don't have any assumptions about possible distributions, so they can be multimodal (having several peaks), exponential, practically anything and very far from normal or other bell shaped.

I refrain from criticism of other metrics, as promised, but note only what is wrong with two popular metrics Kolmogorov-Smirnov Test (KST) and Kullback-Leibler Divergence (KLD).

KST is comparison of maximum distance between two cumulative distributions with table values with conclusion of "passed/failed" type. It depends on the sizes of samples. It may fail for the sample of 100 items but pass for random sub-sample from it of the half size. When failed it may be only few percent off the passing limit, and by insignificant changes in the model I can alter the result and more tests start passing. So, it allows manipulation by driving the tests into wanted result.

KLD for the data samples is a comparison of histograms. The result depends on the bins, which must be chosen for both samples in the same way, but when one sample is significantly smaller, the quantities in some bins becoming small and unstable.

The suggested test is a Median Tree. We find medians for each of two samples, then two pairs of nested medians inside divided left and right parts. Then nested medians for each of four divided parts and so on. The number of such divisive steps is called DEPTH. Sorted medians form two vectors X, Y, which proximity is evaluated by relative distance R $$ R = \dfrac {2 || X - Y ||} {|| X || + || Y ||} $$ Theoretical limits are [0,2], but for two random vectors they are usually near 1, so it can conventionally be assumed that the measure limits are [0,1].

Numerical examples

For this DEMO I chose bimodal distribution generated by two dice sets and probabilistic switch. Let say, for example, one set has 5 dice and another set has 9 dice. If we assign the probability of choosing first set as 0.4 and the other as opposite event, the distribution density will look like in the picture below. After dice set is chosen, it is rolled and the sum of outcome values becomes the item in data sample.



For getting the intuitive understanding of this measure I provide quick comparison of several samples. They are taken from two different populations, we name them A and B. They both are two sets of dice with probabilistic switch and their parameters are shown in the table below.

NameProbability of choosing first setQuantity 1Quantity 2
A0.459
B0.668


The median trees for samples of different sizes are shown in the next two tables for depth = 5.

A400012, 14, 14, 15, 16, 17, 18, 19, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 29, 30, 31, 32, 33, 33, 34, 35, 36, 37, 38, 40
A20013, 14, 15, 15, 17, 18, 18, 19, 20, 20, 22, 23, 23, 25, 26, 27, 28, 29, 30, 30, 31, 32, 32, 32, 33, 34, 34, 35, 37, 39, 40


B400014, 16, 17, 17, 18, 18, 19, 20, 20, 21, 21, 22, 22, 22, 23, 23, 24, 24, 25, 25, 26, 26, 27, 27, 28, 29, 30, 31, 32, 33, 35
B20014, 16, 16, 16, 17, 18, 19, 19, 19, 20, 20, 21, 21, 22, 22, 23, 23, 24, 24, 25, 25, 26, 27, 28, 28, 29, 29, 31, 32, 33, 35


Here are relative distances:

R(A4000, A200) = 0.028
R(B4000, B200) = 0.029
R(A4000, B200) = 0.134
R(B4000, A200) = 0.135

The results clearly show the differences when samples are taken from different populations independently of the sizes, although the distributions for these different populations are not significantly different. The results are very stable. If to generate another random data with the same parameters, the relative distances may not even show the difference for the the first three decimal digits shown above.