Verify distribution uniformity/Chi-squared test: Difference between revisions

From Rosetta Code
Content added Content deleted
m (Fixed lang tags.)
Line 6: Line 6:


This is an Explicit solution:
This is an Explicit solution:
<lang j>
<lang j>require 'stats/base'
require 'stats/base'


NB.*isUniform v Tests whether y is uniformly distributed
NB.*isUniform v Tests whether y is uniformly distributed
Line 22: Line 21:
degfreedom=. <: x NB. degrees of freedom
degfreedom=. <: x NB. degrees of freedom
signif > degfreedom chisqcdf :: 1: X2
signif > degfreedom chisqcdf :: 1: X2
)</lang>
)
</lang>
This is the equivalent Tacit solution:
This is the equivalent Tacit solution:
<lang j>
<lang j>require 'stats/base'
require 'stats/base'


SIGNIF=: 0.95 NB. set significance level
SIGNIF=: 0.95 NB. set significance level
Line 39: Line 36:
NB. y is: distribution to test
NB. y is: distribution to test
NB. x is: optionally specify number of categories possible
NB. x is: optionally specify number of categories possible
isUniformT=: (countCats $: ]) : (SIGNIF > calcDf chisqcdf :: 1: calcX2)
isUniformT=: (countCats $: ]) : (SIGNIF > calcDf chisqcdf :: 1: calcX2)</lang>
</lang>


Verbs and Distributions for testing:
Verbs and Distributions for testing:
<lang j>
<lang j>freqCount=: ~. ,. #/.~
freqCount=: ~. ,. #/.~


testFair=: monad define
testFair=: monad define
Line 51: Line 46:


FairDistrib=: 1e6 ?@$ 5
FairDistrib=: 1e6 ?@$ 5
UnfairDistrib=: (9.5e5 ?@$ 5) , (5e4 ?@$ 4)
UnfairDistrib=: (9.5e5 ?@$ 5) , (5e4 ?@$ 4)</lang>
</lang>


Usage:
Usage:
<lang j>
<lang j> testFair FairDistrib
testFair FairDistrib
distribution 4 143155 0 143085 1 143111 2 142706 6 142666 5 142365 3 142912 assessed as fair
distribution 4 143155 0 143085 1 143111 2 142706 6 142666 5 142365 3 142912 assessed as fair
testFair UnfairDistrib
testFair UnfairDistrib
Line 64: Line 57:
1
1
4 isUniformT 4 4 4 5 5 5 5 5 5 5 NB. not uniform if 4 categories possible
4 isUniformT 4 4 4 5 5 5 5 5 5 5 NB. not uniform if 4 categories possible
0</lang>
0

</lang>


=={{header|OCaml}}==
=={{header|OCaml}}==
Line 104: Line 95:
[| 199809; 200665; 199607; 200270; 199649 |];
[| 199809; 200665; 199607; 200270; 199649 |];
[| 522573; 244456; 139979; 71531; 21461 |]
[| 522573; 244456; 139979; 71531; 21461 |]
]
]</lang>
</lang>


Output
Output

Revision as of 17:12, 22 November 2009

Task
Verify distribution uniformity/Chi-squared test
You are encouraged to solve this task according to the task description, using any language you may know.

In this task, write a function to verify that a given distribution of values is uniform by using the test to see if the distribution has a likelihood of happening of at least the significance level (conventionally 5%). The function should return a boolean that is true if the distribution is one that a uniform distribution (with appropriate number of degrees of freedom) may be expected to produce.

J

J has two types of verb definitions, Explicit and Tacit.

This is an Explicit solution: <lang j>require 'stats/base'

NB.*isUniform v Tests whether y is uniformly distributed NB. result is: boolean describing if distribution y is uniform NB. y is: distribution to test NB. x is: optionally specify number of categories possible isUniform=: verb define

 (#~. y) isUniform y
 signif=. 0.95                    NB. set significance level
 expected=. (#y) % x              NB. number of items divided by the category count
 observed=. #/.~ y                NB. frequency count for each category
 X2=. +/ (*: observed - expected) % expected  NB. the test statistic
 degfreedom=. <: x                NB. degrees of freedom
 signif > degfreedom chisqcdf :: 1: X2

)</lang> This is the equivalent Tacit solution: <lang j>require 'stats/base'

SIGNIF=: 0.95 NB. set significance level countCats=: #@~. NB. counts the number of unique items getExpected=: #@] % [ NB. divides no of items by category count getObserved=: #/.~@] NB. counts frequency for each category calcX2=: [: +/ *:@(getObserved - getExpected) % getExpected NB. calculates test statistic calcDf=: <:@[ NB. calculates degrees of freedom for uniform distribution

NB.*isUniformT v Tests whether y is uniformly distributed NB. result is: boolean describing if distribution y is uniform NB. y is: distribution to test NB. x is: optionally specify number of categories possible isUniformT=: (countCats $: ]) : (SIGNIF > calcDf chisqcdf :: 1: calcX2)</lang>

Verbs and Distributions for testing: <lang j>freqCount=: ~. ,. #/.~

testFair=: monad define

 'distribution ', (":,freqCount y) , ' assessed as ', (isUniform y) {:: ;:'unfair fair'

)

FairDistrib=: 1e6 ?@$ 5 UnfairDistrib=: (9.5e5 ?@$ 5) , (5e4 ?@$ 4)</lang>

Usage: <lang j> testFair FairDistrib distribution 4 143155 0 143085 1 143111 2 142706 6 142666 5 142365 3 142912 assessed as fair

  testFair UnfairDistrib

distribution 0 203086 1 201897 2 202648 3 202388 4 189981 assessed as unfair

  isUniformT 4 4 4 5 5 5 5 5 5 5     NB. uniform if only 2 categories possible

1

  4 isUniformT 4 4 4 5 5 5 5 5 5 5   NB. not uniform if 4 categories possible

0</lang>

OCaml

This code needs to be compiled with library gsl.cma.

<lang ocaml>let sqr x = x *. x

let chi2UniformDistance distrib =

 let count, len = Array.fold_left (fun (s, c) e -> s + e, succ c)
 				   (0, 0) distrib in
 let expected = float count /. float len in
 let distance = Array.fold_left (fun s e ->
   s +. sqr (float e -. expected) /. expected
 ) 0. distrib in
 let dof = float (pred len) in
 dof, distance

let chi2Proba dof distance =

 Gsl_sf.gamma_inc_Q (0.5 *. dof) (0.5 *. distance)

let chi2IsUniform distrib significance =

 let dof, distance = chi2UniformDistance distrib in
 let likelihoodOfRandom = chi2Proba dof distance in
 likelihoodOfRandom > significance

let _ =

 List.iter (fun distrib ->
   let dof, distance = chi2UniformDistance distrib in
   Printf.printf "distribution ";
   Array.iter (Printf.printf "\t%d") distrib;
   Printf.printf "\tdistance %g" distance;
   Printf.printf "\t[%g > 0.05]" (chi2Proba dof distance);
   if chi2IsUniform distrib 0.05 then Printf.printf " fair\n"
   else Printf.printf " unfair\n"
 ) 
 [
   [| 199809; 200665; 199607; 200270; 199649 |];
   [| 522573; 244456; 139979; 71531; 21461 |]
 ]</lang>

Output

distribution    199809  200665  199607  200270  199649  distance 4.14628        [0.386571 > 0.05] fair
distribution    522573  244456  139979  71531   21461   distance 790063 [0 > 0.05] unfair

Tcl

Works with: Tcl version 8.5
Library: tcllib

<lang tcl>package require Tcl 8.5 package require math::statistics

proc isUniform {distribution {significance 0.05}} {

   set count [tcl::mathop::+ {*}[dict values $distribution]]
   set expected [expr {double($count) / [dict size $distribution]}]
   set X2 0.0
   foreach value [dict values $distribution] {

set X2 [expr {$X2 + ($value - $expected)**2 / $expected}]

   }
   set degreesOfFreedom [expr {[dict size $distribution] - 1}]
   set likelihoodOfRandom [::math::statistics::incompleteGamma \

[expr {$degreesOfFreedom / 2.0}] [expr {$X2 / 2.0}]]

   expr {$likelihoodOfRandom > $significance}

}</lang> Testing: <lang tcl>proc makeDistribution {operation {count 1000000}} {

   for {set i 0} {$i<$count} {incr i} {incr distribution([uplevel 1 $operation])}
   return [array get distribution]

}

set distFair [makeDistribution {expr int(rand()*5)}] puts "distribution \"$distFair\" assessed as [expr [isUniform $distFair]?{fair}:{unfair}]" set distUnfair [makeDistribution {expr int(rand()*rand()*5)}] puts "distribution \"$distUnfair\" assessed as [expr [isUniform $distUnfair]?{fair}:{unfair}]"</lang> Output:

distribution "0 199809 4 199649 1 200665 2 199607 3 200270" assessed as fair
distribution "4 21461 0 522573 1 244456 2 139979 3 71531" assessed as unfair