Statistics/Normal distribution: Difference between revisions

Content added Content deleted

Inline

Revision as of 12:02, 11 April 2013

The Normal (or Gaussian) distribution is a freqently used distribution in statistics. While most programming languages provide a uniformly distributed random number generator, one can derive normally distributed random numbers from a uniform generator.

The task

Take a uniform random number generator and create a large (you decide how large) set of numbers that follow a normal (Gaussian) distribution. Calculate the dataset's mean and stddev, and show the histogram here.
Mention any native language support for the generation of normally distributed random numbers.

Reference

You may refer to code in Statistics/Basic if available.

C++

showing features of C++11 here <lang C++>#include <random>

include <map>
include <string>
include <iostream>
include <cmath>
include <iomanip>

int main( ) {

  std::random_device myseed ;
  std::mt19937 engine ( myseed( ) ) ;
  std::normal_distribution<> normDistri ( 2 , 3 ) ;
  std::map<int , int> normalFreq ;
  int sum = 0 ; //holds the sum of the randomly created numbers
  double mean = 0.0 ;
  double stddev = 0.0 ;
  for ( int i = 1 ; i < 10001 ; i++ ) 
     ++normalFreq[ normDistri ( engine ) ] ;
  for ( auto MapIt : normalFreq ) {
     sum += MapIt.first * MapIt.second ;
  }
  mean = sum / 10000 ;
  stddev = sqrt( sum / 10000 ) ;
  std::cout << "The mean of the distribution is " << mean << " , the " ;
  std::cout << "standard deviation " << stddev << " !\n" ;
  std::cout << "And now the histogram:\n" ;
  for ( auto MapIt : normalFreq ) {
     std::cout << std::left << std::setw( 4 ) << MapIt.first <<

std::string( MapIt.second / 100 , '*' ) << std::endl ;

  }
  return 0 ;

}</lang> Output:

The mean of the distribution is 1 , the standard deviation 1 !
And now the histogram:
-10 
-9  
-8  
-7  
-6  
-5  
-4  *
-3  **
-2  ****
-1  ******
0   *********************
1   ************
2   ************
3   ***********
4   *********
5   ******
6   ****
7   **
8   *
9   
10  
11  
12  
13

D

This uses the Box-Muller method as in the Go entry, and the module from the Statistics/Basic. A ziggurat-based normal generator for the Phobos standard library is in the works. <lang d>import std.stdio, std.random, std.math, std.range, std.algorithm,

      statistics_basic;

struct Normals {

   double mu, sigma;
   double[2] state;
   size_t index = state.length;
   enum empty = false;

   void popFront() pure nothrow { index++; }

   @property double front() {
       if (index >= state.length) {
           immutable r = sqrt(-2 * uniform!"]["(0., 1.).log) * sigma;
           immutable x = 2 * PI * uniform(0.0, 1.0);
           state = [mu + r * x.sin, mu + r * x.cos];
           index = 0;
       }
       return state[index];
   }

}

void main() {

   const data = Normals(0.0, 0.5).take(100_000).array;
   writefln("Mean: %8.6f, SD: %8.6f\n", data.meanStdDev[]);
   data.map!q{ max(0.0, min(0.9999, a / 3 + 0.5)) }.showHistogram01;

}</lang>

Output:

Mean: 0.000528, SD: 0.502245

 0.0: *
 0.1: ******
 0.2: *****************
 0.3: ***********************************
 0.4: *************************************************
 0.5: **************************************************
 0.6: **********************************
 0.7: *****************
 0.8: ******
 0.9: *

Go

Box-Muller method shown here. Go has a normally distributed random function in the standard library, as shown in the Go Random numbers solution. It uses the ziggurat method. <lang go>package main

import (

   "fmt"
   "math"
   "math/rand"
   "strings"

)

// Box-Muller func norm2() (s, c float64) {

   r := math.Sqrt(-2 * math.Log(rand.Float64()))
   s, c = math.Sincos(2 * math.Pi * rand.Float64())
   return s * r, c * r

}

func main() {

   const (
       n     = 10000
       bins  = 12
       sig   = 3
       scale = 100
   )
   var sum, sumSq float64
   h := make([]int, bins)
   for i, accum := 0, func(v float64) {
       sum += v
       sumSq += v * v
       b := int((v + sig) * bins / sig / 2)
       if b >= 0 && b < bins {
           h[b]++
       }
   }; i < n/2; i++ {
       v1, v2 := norm2()
       accum(v1)
       accum(v2)
   }
   m := sum / n
   fmt.Println("mean:", m)
   fmt.Println("stddev:", math.Sqrt(sumSq/float64(n)-m*m))
   for _, p := range h {
       fmt.Println(strings.Repeat("*", p/scale))
   }

}</lang> Output:

mean: -0.0034970888831523488
stddev: 1.0040682925006286

*
****
*********
***************
*******************
******************
**************
*********
****
*

Mathematica

<lang Mathematica>x:= RandomReal[1] SampleNormal[n_] := (Print[#//Length, " numbers, Mean : ", #//Mean, ", StandardDeviation : ", #//StandardDeviation];

   Histogram[#, BarOrigin -> Left,Axes -> False])& [(Table[(-2*Log[x])^0.5*Cos[2*Pi*x], {n} ]]

Invocation: SampleNormal[ 10000 ] ->10000 numbers, Mean : -0.0122308, StandardDeviation : 1.00646 </lang>

MATLAB / Octave

<lang Matlab> N = 100000;

 x = randn(N,1);
 mean(x)
 std(x) 
 [nn,xx] = hist(x,100);
 bar(xx,nn);</lang>

Liberty BASIC

Uses LB Statistics/Basic <lang lb>call sample 100000

end

sub sample n

   dim dat( n)
   for i =1 to n
       dat( i) =normalDist( 1, 0.2)
   next i

   '// show mean, standard deviation. Find max, min.
   mx  =-1000
   mn  = 1000
   sum =0
   sSq =0
   for i =1 to n
       d =dat( i)
       mx =max( mx, d)
       mn =min( mn, d)
       sum =sum +d
       sSq =sSq +d^2
   next i
   print n; " data terms used."

   mean =sum / n
   print "Largest term was "; mx; " & smallest was "; mn
   range =mx -mn
   print "Mean ="; mean

   print "Stddev ="; ( sSq /n -mean^2)^0.5

   '// show histogram
   nBins =50
   dim bins( nBins)
   for i =1 to n
       z =int( ( dat( i) -mn) /range *nBins)
       bins( z) =bins( z) +1
   next i
   for b =0 to nBins -1
       for j =1 to int( nBins *bins( b)) /n *30)
           print "#";
       next j
       print
   next b
   print

end sub

function normalDist( m, s) ' Box Muller method

   u =rnd( 1)
   v =rnd( 1)
   normalDist =( -2 *log( u))^0.5 *cos( 2 *3.14159265 *v)

end function</lang>

100000 data terms used.
Largest term was 4.12950792 & smallest was -4.37934139
Mean =-0.26785425e-2
Stddev =1.00097669

#
##
###
#####
########
############
################
########################
##############################
######################################
##############################################
########################################################
###################################################################
##############################################################################
#######################################################################################
################################################################################################
####################################################################################################
########################################################################################################
#####################################################################################################
##############################################################################################
#########################################################################################
##################################################################################
#########################################################################
##############################################################
####################################################
##########################################
#################################
##########################
##################
#############
#########
######
####
##
#
#

PARI/GP

Works with: PARI/GP version 2.4.3 and above

<lang parigp>rnormal()={ my(u1=random(1.),u2=random(1.); sqrt(-2*log(u1))*cos(2*Pi*u1) \\ Could easily be extended with a second normal at very little cost. }; mean(v)={

 sum(i=1,#v,v[i])/#v

}; stdev(v,mu="")={

 if(mu=="",mu=mean(v));
 sqrt(sum(i=1,#v,(v[i]-mu)^2))/#v

}; histogram(v,bins=16,low=0,high=1)={

 my(u=vector(bins),width=(high-low)/bins);
 for(i=1,#v,u[(v[i]-low)\width+1]++);
 u

}; show(n)={

 my(v=vector(n,i,rnormal()),m=mean(v),s=stdev(v,m),h,sz=ceil(n/300));
 h=histogram(v,,vecmin(v)-.1,vecmax(v)+.1);
 for(i=1,#h,for(j=1,h[i]\sz,print1("#"));print());

}; show(10^4)</lang>

For versions before 2.4.3, define <lang parigp>rreal()={

 my(pr=32*ceil(default(realprecision)*log(10)/log(4294967296))); \\ Current precision
 random(2^pr)*1.>>pr

};</lang> and use rreal() in place of random(1.).

A PARI implementation: <lang C>GEN rnormal(long prec) { pari_sp ltop = avma; GEN u1, u2, left, right, ret; u1 = randomr(prec); u2 = randomr(prec); left = sqrtr_abs(shiftr(mplog(u1), 1)); right = mpcos(mulrr(shiftr(mppi(prec), 1), u2));

ret = mulrr(left, right); ret = gerepileupto(ltop, ret); return ret; }</lang> Use mpsincos and caching to generate two values at nearly the same cost.

Perl 6

<lang perl6>constant τ = 2 * pi;

sub normdist ($m, $σ) {

   gather loop {
       my $r = sqrt -2 * log rand;
       my $Θ = τ * rand;
       take $r * $_($Θ) * $σ + $m for &cos, &sin;
   }

}

sub MAIN ($size = 100000, $mean = 50, $stddev = 4) {

   my @dataset = normdist($mean,$stddev)[^$size];

   my $m = [+](@dataset) / $size;
   say (:$m);

   my $σ = sqrt [+](@dataset X** 2) / $size - $mean**2;
   say (:$σ);

   (my %hash){.round}++ for @dataset;
   my $scale = 180 * $stddev / $size;
   constant @subbar = < ⎸ ▏ ▎ ▍ ▌ ▋ ▊ ▉ █ >;
   for %hash.keys».Int.minmax(+*) -> $i {
       my $x = (%hash{$i} // 0) * $scale;
       my $full = floor $x;
       my $part = 8 * ($x - $full);
       say $i, "\t", '█' x $full, @subbar[$part];
   }

}</lang>

Output:

"m" => 50.006107405837142e0
"σ" => 4.0814435639885254e0
33	⎸
34	⎸
35	⎸
36	▏
37	▎
38	▊
39	█▋
40	███⎸
41	█████▊
42	██████████⎸
43	███████████████▋
44	███████████████████████▏
45	████████████████████████████████▌
46	███████████████████████████████████████████▍
47	██████████████████████████████████████████████████████▏
48	███████████████████████████████████████████████████████████████▏
49	█████████████████████████████████████████████████████████████████████▋
50	███████████████████████████████████████████████████████████████████████▊
51	█████████████████████████████████████████████████████████████████████▌
52	███████████████████████████████████████████████████████████████⎸
53	██████████████████████████████████████████████████████▎
54	███████████████████████████████████████████⎸
55	████████████████████████████████▌
56	███████████████████████▍
57	███████████████▉
58	█████████▉
59	█████▍
60	███▍
61	█▋
62	▊
63	▍
64	▏
65	⎸
66	⎸
67	⎸

PureBasic

<lang purebasic>Procedure.f randomf(resolution = 2147483647)

 ProcedureReturn Random(resolution) / resolution

EndProcedure

Procedure.f normalDist() ;Box Muller method

  ProcedureReturn Sqr(-2 * Log(randomf())) * Cos(2 * #PI * randomf())

EndProcedure

Procedure sample(n, nBins = 50)

 Protected i, maxBinValue, binNumber
 Protected.f d, mean, sum, sumSq, mx, mn, range
 
 Dim dat.f(n)
 For i = 1 To n
   dat(i) = normalDist()
 Next
 
 ;show mean, standard deviation, find max & min.
 mx  = -1000
 mn  =  1000
 sum = 0
 sumSq = 0
 For i = 1 To n
   d = dat(i)
   If d > mx: mx = d: EndIf
   If d < mn: mn = d: EndIf
   sum + d
   sumSq + d * d
 Next
 
 PrintN(Str(n) + " data terms used.")
 PrintN("Largest term was " + StrF(mx) + " & smallest was " + StrF(mn))
 mean = sum / n
 PrintN("Mean = " + StrF(mean))
 PrintN("Stddev = " + StrF((sumSq / n) - Sqr(mean * mean)))
 
 ;show histogram
 range = mx - mn
 Dim bins(nBins)
 For i = 1 To n
   binNumber = Int(nBins * (dat(i) - mn) / range)
   bins(binNumber) + 1
 Next
  
 maxBinValue = 1
 For i = 0 To nBins
   If bins(i) > maxBinValue
     maxBinValue = bins(i)
   EndIf
 Next
 
 #normalizedMaxValue = 70
 For binNumber = 0 To nBins
   tickMarks = Round(bins(binNumber) * #normalizedMaxValue / maxBinValue, #PB_Round_Nearest)
   PrintN(ReplaceString(Space(tickMarks), " ", "#"))
 Next
 PrintN("")

EndProcedure

If OpenConsole()

 sample(100000)
 
 Print(#CRLF$ + #CRLF$ + "Press ENTER to exit"): Input()
 CloseConsole()

EndIf</lang> Sample output:

100000 data terms used.
Largest term was 4.5352029800 & smallest was -4.5405135155
Mean = 0.0012346541
Stddev = 0.9959455132





#
###
######
##########
##################
############################
#########################################
#####################################################
################################################################
######################################################################
######################################################################
################################################################
#####################################################
#########################################
#############################
##################
##########
######
###
#

Python

This uses the external matplotlib package as well as the built-in standardlib function random.gauss. <lang python>from __future__ import division import matplotlib.pyplot as plt import random

mean, stddev, size = 50, 4, 100000 data = [random.gauss(mean, stddev) for c in range(size)]

mn = sum(data) / size sd = (sum(x*x for x in data) / size

     - (sum(data) / size) ** 2) ** 0.5

print("Sample mean = %g; Stddev = %g; max = %g; min = %g for %i values"

     % (mn, sd, max(data), min(data), size))

plt.hist(data,bins=50)</lang>

Output:

Sample mean = 49.9822; Stddev = 4.00938; max = 66.8091; min = 33.5283 for 100000 values

REXX

The REXX language doesn't have any "higher math" functions like SIN/COS/LN/LOG/SQRT/POW/etc,
so we hoi polloi programmers have roll our own. <lang rexx>/*REXX pgm gens 10,000 normally distributed #s (Gaussian distribution).*/ parse arg n seed . /*allow specification of N | seed*/ if n== | n==',' then n=10000 /* N is the size of the array. */ if seed\== then call random ,,seed /*use seed for repeatable RANDOM#*/ call pi /*call subroutine to define pi. */

         do g=1  for n                /*generate N uniform random nums.*/
         #.g=sqrt(-2*ln(rand())) * cos(2*pi*rand())
         end   /*g*/

mn=#.1; mx=mn; s=0; ss=0; noise=n*.0005 /*calculate noise: .05%∙N.*/

         do j=1  for n;   _=#.j;   s=s+_;    ss=ss+_*_   /*sum & sum ²s*/
         mn=min(mn,#.j);  mx=max(mx,#.j)                 /*find min/max*/
         end   /*j*/

!.=0 say 'number of data points = ' n say ' minimum =' mn say ' maximum = ' mx say ' arithmetic mean = ' s/n say ' standard deviation = ' sqrt(ss/n - (s/n)**2) r=mx-mn /*used for scaling the histogram.*/ sd=50; sde=sd-4 /*screen depth, effective depth. */ sw=80; swe=sw-1 /*screen width, effective width. */ ?mn=; ?mx=

         do i=1  for n
         ?=trunc( (#.i-mn)/r * sde)
         !.?=!.?+1
         if ?mn== then do; ?mn=!.j;  ?mx=!.j; end
         ?mn=min(?mn,!.?)
         ?mx=max(?mx,!.?)
         end   /*h*/

f=swe/?mx

         do h=0  for sde;                     _=!.h
         if _>noise then say copies('─',trunc(_*f))   /*show histogram.*/
         end   /*h*/

exit /*stick a fork in it, we're done.*/ /*──────────────────────────────────subroutines─────────────────────────*/ e: e=2.7182818284590452353602874713526624977572470936999595749669676277240766303535; return e pi: pi=3.1415926535897932384626433832795028841971693993751058209749445923078164062862; return pi rand: return random(0,1e5)/1e5 /*REXX gens uniform rand integers*/ r2r: return arg(1)//(2*pi()) sqrt: procedure;parse arg x; if x=0 then return 0; d=digits(); numeric digits 11; g=.sqrtGuess()

       do j=0 while p>9; m.j=p; p=p%2+1; end; do k=j+5 to 0 by -1; if m.k>11 then numeric digits m.k
       g=.5*(g+x/g); end; numeric digits d; return g/1

.sqrtGuess: numeric form; m.=11; p=d+d%4+2

       parse value format(x,2,1,,0) 'E0' with g 'E' _ .; return g*.5'E'_%2

cos: procedure; arg x; x=r2r(x); a=abs(x); numeric fuzz min(9,digits()-9); if a=pi() then return -1

       if a=pi()/2|a=2*pi() then return 0;if a=pi()/3 then return .5;if a=2*pi()/3 then return -.5;return .sincos(1,1,-1)

.sincos:parse arg z,_,i; x=x*x; p=z; do k=2 by 2; _=-_*x/(k*(k+i)); z=z+_; if z=p then leave; p=z; end; return z ln: procedure; parse arg x,f; call e; ig=x>1.5; is=1-2*(ig\==1); ii=0; xx=x; return .ln_comp() .ln_comp: do while ig&xx>1.5|\ig&xx<.5;_=e;do k=-1;iz=xx*_**-is;if k>=0&(ig&iz<1|\ig&iz>.5) then leave;_=_*_;izz=iz;end

       xx=izz;ii=ii+is*2**k;end;x=x*e**-ii-1;z=0;_=-1;p=z;do k=1;_=-_*x;z=z+_/k;if z=p then leave;p=z;end;return z+ii

</lang> output when using the default input

number of data points =  10000
              minimum = -4.16421320
              maximum =  3.90674449
      arithmetic mean =  -0.00429983726
   standard deviation =  0.998221879

─
───
──
────
───────
────────────
──────────────
─────────────────────
────────────────────────────
────────────────────────────────
────────────────────────────────────────────
───────────────────────────────────────────────────
─────────────────────────────────────────────────────────────
──────────────────────────────────────────────────────────────
─────────────────────────────────────────────────────────────────────────
─────────────────────────────────────────────────────────────────────────────
───────────────────────────────────────────────────────────────────────────────
──────────────────────────────────────────────────────────────────────────────
──────────────────────────────────────────────────────────────────────────
──────────────────────────────────────────────────────────────────────
──────────────────────────────────────────────────────────────
───────────────────────────────────────────────────────
─────────────────────────────────────────────
────────────────────────────────
───────────────────────────────
────────────────────────
───────────────────
─────────────
────────
─────
────
──
──
─

SAS

<lang sas>data test; n=100000; twopi=2*constant('pi'); do i=1 to n; u=ranuni(0); v=ranuni(0); r=sqrt(-2*log(u)); x=r*cos(twopi*v); y=r*sin(twopi*v); z=rannor(0); output; end; keep x y z;

proc means mean stddev;

proc univariate; histogram /normal;

run;

/* Variable Mean Std Dev

x -0.0052720 0.9988467 y 0.000023995 1.0019996 z 0.0012857 1.0056536

/</lang>

Tcl

<lang tcl>package require Tcl 8.5

Uses the Box-Muller transform to compute a pair of normal random numbers

proc tcl::mathfunc::nrand {mean stddev} {

   variable savednormalrandom
   if {[info exists savednormalrandom]} {

return [expr {$savednormalrandom*$stddev + $mean}][unset savednormalrandom]

   }
   set r [expr {sqrt(-2*log(rand()))}]
   set theta [expr {2*3.1415927*rand()}]
   set savednormalrandom [expr {$r*sin($theta)}]
   expr {$r*cos($theta)*$stddev + $mean}

} proc stats {size {slotfactor 10}} {

   set sum 0.0
   set sum2 0.0
   for {set i 0} {$i < $size} {incr i} {

set r [expr { nrand(0.5, 0.2) }]

incr histo([expr {int(floor($r*$slotfactor))}]) set sum [expr {$sum + $r}] set sum2 [expr {$sum2 + $r**2}]

   }
   set mean [expr {$sum / $size}]
   set stddev [expr {sqrt($sum2/$size - $mean**2)}]
   puts "$size numbers"
   puts "Mean:   $mean"
   puts "StdDev: $stddev"
   foreach i [lsort -integer [array names histo]] {

puts [string repeat "*" [expr {$histo($i)*350/int($size)}]]

}

stats 100 puts "" stats 1000 puts "" stats 10000 puts "" stats 100000 20</lang> Sample output:

100 numbers
Mean:   0.49355955990390254
StdDev: 0.19651396178121985
***
*******
**************
***********************************
********************************************************
******************************************************************
*************************************************************************
******************************************
**************************************
**************

1000 numbers
Mean:   0.5066940614105869
StdDev: 0.2016794788065389


*
*****
**************
****************************
**********************************************************
****************************************************************
*************************************************************
******************************************************
***********************************
************
*********
*

10000 numbers
Mean:   0.49980964730768285
StdDev: 0.1968441612522318

*
*****
***************
*******************************
*****************************************************
******************************************************************
*******************************************************************
****************************************************
*********************************
***************
*****
*



100000 numbers
Mean:   0.49960438950922254
StdDev: 0.20060211160998606





*
**
***
******
*********
**************
******************
***********************
*****************************
********************************
**********************************
**********************************
********************************
****************************
***********************
******************
*************
*********
******
***
**
*

The blank lines in the output are where the number of samples is too small to even merit a single unit on the histogram.