module analysis::statistics::Inference
Statistical inference methods.
Usage
import analysis::statistics::Inference;
Description
The following functions are provided:
function chiSquare
Chi-square coefficient of data values.
num chiSquare(lrel[num expected, int observed] values)
Compute the ChiSquare statistic comparing observed and expected frequency counts.
Examples
Consider an example from the web page mentioned above. To test the hypothesis that a random sample of 100 people has been drawn from a population in which men and women are equal in frequency, the observed number of men and women would be compared to the theoretical frequencies of 50 men and 50 women. If there were 44 men in the sample and 56 women, then we have the following:
rascal>import analysis::statistics::Inference;
ok
rascal>chiSquare([<50, 44>, <50, 56>])
num: 1.44
function chiSquareTest
Chi-square test on data values.
num chiSquareTest(lrel[num expected, int observed] values)
bool chiSquareTest(lrel[num expected, int observed] values, real alpha)
Perform a chi-square test comparing expected and observed frequency counts. There are two forms of this test:
Returns the observed significance level, or p-value, associated with a Chi-square goodness of fit test comparing observed frequency counts to expected counts.
Performs a Chi-square goodness of fit test evaluating the null hypothesis that the observed counts conform to the frequency distribution described by the expected counts, with significance level
alpha
(0 <alpha
< 0.5). Returns true iff the null hypothesis can be rejected with confidence 1 -alpha
.
function tTest
T-test on sample data.
num tTest(list[num] sample1, list[num] sample2)
bool tTest(list[num] sample1, list[num] sample2, num alpha)
bool tTest(num mu, list[num] sample, num alpha)
Perform student's t-test The test is provided in three variants:
- Returns the observed significance level, or p-value, associated with a two-sample, two-tailed t-test comparing the means of the input samples. The number returned is the smallest significance level at which one can reject the null hypothesis that the two means are equal in favor of the two-sided alternative that they are different. For a one-sided test, divide the returned value by 2.
The t-statistic used is t = (m1 - m2) / sqrt(var1/n1 + var2/n2)
where
n1
is the size of the first sample
n2
is the size of the second sample;
m1
is the mean of the first sample;
m2
is the mean of the second sample;
var1
is the variance of the first sample;
var2
is the variance of the second sample.
Performs a two-sided t-test evaluating the null hypothesis that
sample1
andsample2
are drawn from populations with the same mean, with significance levelalpha
. This test does not assume that the subpopulation variances are equal. Returns true iff the null hypothesis that the means are equal can be rejected with confidence 1 -alpha
. To perform a 1-sided test, usealpha
/ 2.Performs a two-sided t-test evaluating the null hypothesis that the mean of the population from which sample is drawn equals
mu
. Returns true iff the null hypothesis can be rejected with confidence 1 -alpha
. To perform a 1-sided test, usealpha
* 2.
Examples
We use the data from the following http://web.mst.edu/~psyworld/texample.htm#1[example] to illustrate the t-test. First, we compute the t-statistic using the formula given above.
rascal>import util::Math;
ok
rascal>import analysis::statistics::Descriptive;
ok
rascal>import List;
ok
rascal>s1 = [5,7,5,3,5,3,3,9];
list[int]: [5,7,5,3,5,3,3,9]
rascal>s2 = [8,1,4,6,6,4,1,2];
list[int]: [8,1,4,6,6,4,1,2]
rascal>(mean(s1) - mean(s2))/sqrt(variance(s1)/size(s1) + variance(s2)/size(s2));
real: 0.84731854581
This is the same result as obtained in the cited example.
We can also compute it directly using the tTest
functions:
rascal>import analysis::statistics::Inference;
ok
rascal>tTest(s1, s2);
num: 0.4115203997374087
Observe that this is a smaller value than comes out directly of the formula. Recall that: The number returned is the smallest significance level at which one can reject the null hypothesis that the two means are equal in favor of the two-sided alternative that they are different. Finally, we perform the test around the significance level we just obtained:
rascal>tTest(s1,s2,0.40);
bool: false
rascal>tTest(s1,s2,0.50);
bool: true
function anovaFValue
Analysis of Variance (ANOVA) f-value.
num anovaFValue(list[list[num]] categoryData)
Perform Analysis of Variance test also described here
Compute the F statistic -- also known as F-test -- using the definitional formula
F = msbg/mswg
where
msbg
= between group mean square.mswg
= within group mean square.
are as defined here
function anovaPValue
Analysis of Variance (ANOVA) p-value.
num anovaPValue(list[list[num]] categoryData)
Perform Analysis of Variance test also described here
Computes the exact p-value using the formula p = 1 - cumulativeProbability(F)
where F
is the Anova F Value.
function anovaTest
Analysis of Variance (ANOVA) test.
bool anovaTest(list[list[num]] categoryData, num alpha)
Perform Analysis of Variance also described here
Returns true iff the estimated p-value is less than alpha
(0 < alpha
<= 0.5).
The exact p-value is computed using the formula p = 1 - cumulativeProbability(F)
where F
is the Anova F Value.
function gini
Gini coefficient.
real gini(lrel[num observation,int frequency] values)
Computes the Gini Coefficient that measures the inequality among values in a frequency distribution.
The Gini coefficient is computed using Deaton's formula and returns a value between 0 (completely equal distribution) and 1 (completely unequal distribution).
Examples
rascal>import analysis::statistics::Inference;
ok
A completely equal distribution:
rascal>gini([<10000, 1>, <10000, 1>, <10000, 1>]);
real: 0.0
A rather unequal distribution:
rascal>gini([<998000, 1>, <20000, 3>, <117500, 1>, <70000, 2>, <23500, 5>, <45200,1>]);
real: 0.8530758129256304