Teach/Me Data Analysis

You are working with the text-only light edition of "H.Lohninger: Teach/Me Data Analysis, Springer-Verlag, Berlin-New York-Tokyo, 1999. ISBN 3-540-14743-8". Click here for further information.

Table of Contents Statistical Tests Comparing Distributions Chi-Square Test	Index
See also: survey on statistical tests, Kolmogorov-Smirnov test, tests for normality, distribution calculator

Chi-Square Test

The hypothesis tests we have used so far assumed that the data is normally distributed. But these assumptions are not always true. So we need a method to check whether our assumption about the distribution of the data is correct.

The easiest way to compare distributions is to compare them visually. We overlay a histogram of the data with the theoretical distribution with which it is to be compared. Of course this approach lacks statistical justification. A sound method to compare empirical and known (parametric) distribution is the -test.

One practical problem is that the evaluation of parametric distribution functions results in probabilities instead of frequencies. In order to compare the empirical and theoretical distribution we have to to estimate the expected frequencies by multiplying the theoretical probabilities by the number of samples.

The probability that the variable falls into a bin [a_i,a_i+1] is the difference of the probabilities of x being less than the bin boundaries a_i and a_i+1, respectively: Prob(a_i < x < a_i+1) = Prob(x < a_i+1) - Prob(x < a_i)

For each bin, the squared difference between the frequencies of the empirical and the theoretical distribution are calculated. The squared differences are divided by the expected frequencies. The sum of these relative or weighted squared differences is the statistic. The null hypothesis is that the two distributions are the same, and the differences are due to random errors.

Note: Another important point to remember is that the theoretical probabilities are normally tabulated for standard parameters, i.e. zero mean and unit variance for the normal distribution. So we either have to standardize the histogram, or estimate the distribution parameters and use them for the calculation of the probabilities in the appropriate bins of the histogram. The number of estimated parameters k has an influence on the degree of freedom used in the -test.

Last Update: 2005-Jul-16