Cumulative standard deviation: Difference between revisions

Cumulative standard deviation (view source)

Revision as of 10:49, 25 June 2022

1,480 bytes removed , 1 year ago

→‎{{header|R}}: 5 solutions, all wrong or hilariously slow: repeating the sum for each subvector leads to a O(n^2) algo

Eoraptor

1,336

edits

Revision as of 19:39, 7 February 2022 (view source) rosettacode>Wasamasa (Emacs Lisp: Improve solutions) ← Older edit		Revision as of 10:49, 25 June 2022 (view source) Eoraptor (talk \| contribs) (→‎{{header\|R}}: 5 solutions, all wrong or hilariously slow: repeating the sum for each subvector leads to a O(n^2) algo) Newer edit →
Line 3,569: =={{header\|R}}== ~~===Built-in Std Dev fn===~~ ~~<lang rsplus>#The built-in standard deviation function applies the Bessel correction. To reverse this, we can apply an uncorrection.~~ ~~#If na.rm is true, missing data points (NA values) are removed.~~ ~~reverseBesselCorrection <- function(x, na.rm=FALSE)~~ { ~~if(na.rm) x <- x[!is.na(x)]~~ len <- length(x)▼ ~~if(len < 2) stop("2 or more data points required")~~ sqrt((len-1)/len)▼ } ~~testdata <- c(2,4,4,4,5,5,7,9)~~ ~~reverseBesselCorrection(testdata)sd(testdata) #2</lang>~~ ~~===From scratch===~~ ~~<lang rsplus>#Again, if na.rm is true, missing data points (NA values) are removed.~~ ~~uncorrectedsd <- function(x, na.rm=FALSE)~~ { ~~len <- length(x)~~ ~~if(len < 2) stop("2 or more data points required")~~ ~~mu <- mean(x, na.rm=na.rm)~~ ~~ssq <- sum((x - mu)^2, na.rm=na.rm)~~ ~~usd <- sqrt(ssq/len)~~ ~~usd~~ } ~~uncorrectedsd(testdata) #2</lang>~~ ~~==="Running" SD===~~ If we desire a solution that gives every "running" standard deviation for each input, rather than only giving one number as our final output, we can do the following. To make this differ from previous solutions, we will not have our code make any mention of missing values, and we will show off R's Reduce and sapply. ~~<lang rsplus>#Once again, we have to make a standard deviation function from scratch.~~ ~~biasedSd <- function(data) sqrt(mean((data - mean(data))^2))~~ ~~cumSd <- function(data) sapply(Reduce(c, data, accumulate = TRUE), biasedSd)</lang>~~ ~~{{out}}~~ ~~<pre>> cumSd(c(2, 4, 4, 4, 5, 5, 7, 9))~~ ~~[1] 0.0000000 1.0000000 0.9428090 0.8660254 0.9797959 1.0000000 1.3997084 2.0000000</pre>~~ To compute the running sum, one must keep track of the number of items, the sum of values, and the sum of squares. ~~===Stateful SD===~~ If we want a function that remembers and uses the previous inputs, letting us be very strict about the "one at a time" requirement, then we can lift biasedSd from the previous solution and make good use of the distinction between R's <- and <<- methods of assignment. If the goal is to get a vector of running standard deviations, the simplest is to do it with cumsum: <lang rsplus>cumSDStateful <- function()▼ { <lang ~~data~~r>cumsd <- ~~numeric~~function(0x) { ▲ ~~len~~n <- ~~length~~seq_along(x) function(oneNumber)▼ sqrt(cumsum(x^2) / n - (cumsum(x) / n)^2) {▼ }</lang> ~~data <<- c(data, oneNumber)~~ ~~biasedSd(data)~~ set.seed(12345L) x <- rnorm(10) cumsd(x) # [1] 0.0000000 0.3380816 0.8752973 1.1783628 1.2345538 1.3757142 1.2867220 1.2229056 1.1665168 1.1096814 # Compare to the naive implementation, i.e. compute sd on each sublist: Vectorize(function(k) sd(x[1:k]) sqrt((k - 1) / k))(seq_along(x)) # [1] NA 0.3380816 0.8752973 1.1783628 1.2345538 1.3757142 1.2867220 1.2229056 1.1665168 1.1096814 # Note that the first is NA because sd is unbiased formula, hence there is a division by n-1, which is 0 for n=1.</lang> The task requires an accumulator solution: ▲<lang ~~rsplus~~r>~~cumSDStateful~~accumsd <- function() { n <- 0 m <- 0 s <- 0 ▲ { ▲ function(~~oneNumber~~x) { n <<- n + 1 m <<- m + x s <<- s + x * x ▲ sqrt(~~(len~~s / n -1) (m /~~len~~ n)^2) } } ~~state <- cumSDStateful()</lang>~~ ~~{{out}}~~ ~~<pre>> state(2);state(4);state(4);state(4);state(5);state(5);state(7);state(9)~~ ~~[1] 0~~ ~~[1] 1~~ ~~[1] 0.942809~~ ~~[1] 0.8660254~~ ~~[1] 0.9797959~~ ~~[1] 1~~ ~~[1] 1.399708~~ ~~[1] 2</pre>~~ f <- accumsd() ~~===Environment solution===~~ sapply(x, f) ~~R gives us some control over what environment expressions are evaluated in. This lets us shorten the previous solution and get identical output.~~ # [1] 0.0000000 0.3380816 0.8752973 1.1783628 1.2345538 1.3757142 1.2867220 1.2229056 1.1665168 1.1096814</lang> ~~<lang rsplus>localCumSD <- local({~~ ~~data <- numeric(0)~~ ~~function(oneNumber)~~ { ~~data <<- c(data, oneNumber)~~ ~~biasedSd(data)#Again, lifted from the ""Running" SD" solution.~~ } }) ~~localCumSD(2);localCumSD(4);localCumSD(4);localCumSD(4);localCumSD(5);localCumSD(5);localCumSD(7);localCumSD(9)</lang>~~ =={{header\|Racket}}==