In IR and many other research domains, we always have to use statistical Test to evaluate whether a newly proposed model can bring significant improvement over baselines. I do not want to judge it is a good means. Here I just introduce how to conduct statistical Test using R, python, etc.

————————————————–

————————————————–

http://www.r-tutor.com/elementary-statistics/non-parametric-methods/wilcoxon-signed-rank-test

##
**Wilcoxon Signed-Rank Test**

Two data samples are matched if they come from repeated observations of the same subject. Using the Wilcoxon Signed-Rank Test, we can decide whether the corresponding data population distributions are identical without assuming them to follow the normal distribution.

**Example**

In the built-in data set named immer, the barley yield in years 1931 and 1932 of the same field are recorded. The yield data are presented in the data frame columns Y1 and Y2.

> library(MASS) # load the MASS package

> head(immer)

Loc Var Y1 Y2

1 UF M 81.0 80.7

2 UF S 105.4 82.3

…..

Problem

> head(immer)

Loc Var Y1 Y2

1 UF M 81.0 80.7

2 UF S 105.4 82.3

…..

Problem

Without assuming the data to have normal distribution, test at .05 significance level if the barley yields of 1931 and 1932 in data set immer have identical data distributions.

**Solution**

The null hypothesis is that the barley yields of the two sample years are identical populations. To test the hypothesis, we apply the wilcox.test function to compare the matched samples. For the paired test, we set the “paired” argument as TRUE. As the p-value turns out to be 0.005318, and is less than the .05 significance level, we reject the null hypothesis.

> wilcox.test(immer$Y1, immer$Y2, paired=TRUE)

Wilcoxon signed rank test with continuity correction

data: immer$Y1 and immer$Y2

V = 368.5, p-value = 0.005318

alternative hypothesis: true location shift is not equal to 0

V = 368.5, p-value = 0.005318

alternative hypothesis: true location shift is not equal to 0

Warning message:

In wilcox.test.default(immer$Y1, immer$Y2, paired = TRUE) :

cannot compute exact p-value with ties

Answer

In wilcox.test.default(immer$Y1, immer$Y2, paired = TRUE) :

cannot compute exact p-value with ties

Answer

At .05 significance level, we conclude that the barley yields of 1931 and 1932 from the data set immer are nonidentical populations.

————————————-

## Wilcoxon Signed Test with Python.

It would be even easier for the Wilcoxon Test with Python.

Just the following lines:

import scipy.stats as stat

wvalue = stat.wilcoxon(diffs)

print “wilcoxon value:”, wvalue

A tool of Wilcoxon Signed Test for TREC evaluation is provided in the following link: