Talk:Welch's t-test: Difference between revisions

Content added Content deleted

Inline

Revision as of 18:23, 8 July 2015

Needs better task description

I haven't looked at the C code yet, but I'm assuming it's using t-test? The description should provide more context and explanations of concepts, and preferably links to algorithms. --Ledrug (talk) 20:16, 26 May 2015 (UTC)

Yes, this uses Welch's 2-sided t-test, as I commented inside the code.

Hi, you need to take all those nuggets out of your code comments and put them into an improved task description. The task description needs to stand on its own as a clear and concise description of what needs to be accomplished.

(P.S. Please sign your comments). --Paddy3118 (talk) 19:09, 27 May 2015 (UTC)

Hi Hailholyghost, I just had a look at the link you give and it is inadequate as a description for an RC task. The task description needs to be written for an audience of enthusiastic programmers - not necessarily maths or stats or whatever enthusiasts. It seems that you are new to RC and maybe you need to take time and lurk a bit more to understand a little more about how things are done.

This task needs a full description of the calculation method to use, probably in pseudocode, together with a decription of what the algorithm should be used for to complete a good task. The Code you give is not enough for a task description. --Paddy3118 (talk) 19:23, 27 May 2015 (UTC)

I added that link, as a pointer to the right direction for now. To be fair though, null hypothesis testing is very involved and sometimes borders black magic, so it may be difficult to explain everything clearly in a short text. The following wiki links may be relevant: wp:Statistical hypothesis testing, wp:ANOVA, and more specifically wp:Student's t-test and wp:Welch's t-test. The Student's t-test article has more details on actual computations, which forms the basis for the Welch's test. --Ledrug (talk) 19:44, 27 May 2015 (UTC)

So task description should perhaps also include explicit cautions about p values... Perhaps xkcd 882 and 1478? --Rdm (talk) 13:32, 3 June 2015 (UTC)

I've improved the C function to work with larger arrays using tgammal instead of tgamma, and have exception handling if the entered array is too small. I have made some modifications to the Simpson integration part, and the function now runs about twice as fast as before. I have also added a description. I have removed comments in my code. I hope this is satisfactory.--Hailholyghost (talk) 18:28, 3 June 2015 (UTC)

Looks like you pulled some of that math out of wikipedia, but even there there's not quite enough context. For example, what is the definition of u and of f(u)? That kind of stuff works in a classroom context where representative examples have been recently referenced, but that's not the case here.

Also, if you are going the math route I think you should mention basic assumptions (for example, I think you are assuming that the list of values were taken from what would be some normal distribution). --Rdm (talk) 20:04, 3 June 2015 (UTC)

I can work on the task description later. On a more practical matter, this code cannot calculate p-value for very large array sizes (> about 1755 elements). Does anyone know how to solve this? ==hailholyghost 15:18 Friday 5 June (UTC)

The fraction ${\frac {\Gamma (a)}{\Gamma (a+0.5)}}$ blows up. How can I get ratio in terms of lgammal? ==hailholyghost 15:26 7 June 2015.

I can get this fraction in terms of $B(a,a+0.5)$ , but it is computationally expensive. At least it works now. As for the task description, how much detail is required? I only put what I considered necessary to the computation, as this is work I did myself. The internet is awash with articles about p-value, so I only linked to those wikipedia articles. The reason I wrote this page is because I was unable to find a way to implement this computation directly, after weeks and weeks of internet searches. I hope that this computer code can be beneficial to others.--hailholyghost 14:25 Tuesday 9 June 2015 UTC.

Just use exp(lgamma(a) - lgamma(a+0.5)). Replacing tgamma with tgammal is only delaying the overflow until longer data (10000 or so?), while loggamma function should not overflow with any reasonable data. --Ledrug (talk) 18:35, 9 June 2015 (UTC)

Hi Ledrug, I tihnk you used the logarithm identity

\log({\frac {a}{b}})=\log(a)-\log(b)

but this doesn't apply here because

{\frac {\Gamma (a)}{\Gamma (a+0.5)}}\implies {\frac {\ln(\Gamma (a))}{\ln(\Gamma (a+0.5))}}\neq \ln(\Gamma (a))-\ln(\Gamma (a+0.5))

rather,

{\frac {\ln(\Gamma (a))}{\ln(\Gamma (a+0.5))}}=\log _{\Gamma (a+0.5)}(\Gamma (a))

which unfortunately doesn't seem to go anywhere.

The answer to this has to buried somewhere in the bowels of the internet... but I can't find it...--hailholyghost 14:07 10 June 2015 (UTC)

How does that not apply?

e^{\ln A-\ln B}=e^{\ln {A \over B}}={A \over B}

where A and B are the gammas, isn't that what you want? --Ledrug (talk) 19:02, 10 June 2015 (UTC)

If I understand the mass of expressions on the task page, you want to evaluate

\mathrm {B} (x,y)={\dfrac {\Gamma (x)\,\Gamma (y)}{\Gamma (x+y)}}\!

But this is equivalent to

\mathrm {B} (x,y)=\exp((\ln(\Gamma (x))+\ln(\Gamma (y)))-\ln(\Gamma (x+y)))\!

Or have I misunderstood what the task needs? --Rdm (talk) 13:31, 10 June 2015 (UTC)

Rdm, thank you so much!!!! --hailholyghost 15:11 EST 10 June 2015 (UST)

--Ledrug you are correct, of course, I put what you said into the task description. I'll put more about the definition of the p-value and warnings, maybe split the task description into two different sections.--hailholyghost 16:00 UTC 13 June 2015 (UST)

Task description complete?

I have made the task description more complete. I consider this page as ready to be published as a complete task. If someone else feels it is not ready, please give me a *specific* description of what's missing or why this isn't yet ready. I tried adding references but had formatting issues. I would like to cite this link, among others, if someone could please show me how to do this: http://www.nature.com/polopoly_fs/1.14700!/menu/main/topColumns/topLeftColumn/pdf/506150a.pdf --Hailholyghost (talk) 13:28, 23 June 2015 (UTC)--

If you look at most of the other tasks and consider the authorship of the examples, then you might agree with me that your task description stands out as being couched in heavy mathematical notation rather than in, say, pseudo-code for example. This puts a barrier between the readership and the task as perfectly proficient programmers would also need to be statisticians to follow the description.

This is not what RC is about - as you can see from other examples where very mathematical concepts such as Quaternions are given in task form that is explained to a programming audience. That has not been done in your draft task.

In short; explain it to the RC audience rather than to yourself - If you don't have an idea of the RC audience, (and that might be the case as you are asking how to create links), then you need to both lurk more on the site and read other tasks until you do. --Paddy3118 (talk) 14:03, 23 June 2015 (UTC)

Probably would be a good idea to include a link to Gamma_function and also to explain how to handle the definite integrals. I think we also need some documentation on how to implement lngamma given a decent implementation of gamma (log of gamma of fractional part of n plus sum of logs of the positive integers less than the integer part of n, or something like that). We might need a bit more than that, but I think we need at least that much. --Rdm (talk) 18:17, 23 June 2015 (UTC)

One more thing we need here is how to calculate the degrees of freedom for a single dataset. You've only supplied an expression for approximating the degrees of freedom of the two sets combined if we already know the degrees of freedom for each of the sets. (Presumably - since you are asking for sample variance - it's N-1 - but that should be specified.) --Rdm (talk) 05:07, 3 July 2015 (UTC)

Another issue: according to wp:Welch's_t_test "

s_{1}^{2}

is sample variance" and you seem to be using the equations from there, but in your task description you currently instead say that "

s_{n}

is the sample variance of set

n

" (Also, sample variance can be calculated in one of two ways - the expression you gave corresponds to what they label "unbiased sample variance" .. perhaps a minor issue? But I implemented what I think your task description has declared I should be calculating and I get a different result than the other task implementations, so I'm having to review all the basics...) --Rdm (talk) 06:03, 3 July 2015 (UTC)

I also noticed the unexplained nu1 and nu2 and the s instead of s^2. In addition in the sample variance formula I think the subscript on the mean should be lower case n. And the term p2-tail is unexplained. This means the same as p-value? —Sonia (talk) 00:07, 7 July 2015 (UTC)

Hello Sonia and Rdm, I think I've answered your questions about sample variance and other errors, and thank you for catching the mistakes. Please let me know if you see any other errors.--Hailholyghost (talk) 06:08, 7 July 2015 (UTC)

Task definition vs. Task implementation

Currently the task implementation asks for

$p=1-{\frac {1}{2}}\times {\frac {\int _{0}^{\frac {\nu }{t^{2}+\nu }}{\frac {r^{{\frac {\nu }{2}}-1}}{\sqrt {1-r}}}\,\mathrm {d} r}{\exp((\ln(\Gamma ({\frac {\nu }{2}}))+\ln(\Gamma (0.5))-\ln(\Gamma ({\frac {\nu }{2}}+0.5)))}}$

But after reading the C implementation, what is actually being calculated is

$p={\frac {\int _{0}^{\frac {\nu }{t^{2}+\nu }}{\frac {r^{{\frac {\nu }{2}}-1}}{\sqrt {1-r}}}\,\mathrm {d} r}{\exp((\ln(\Gamma ({\frac {\nu }{2}}))+\ln(\Gamma (0.5))-\ln(\Gamma ({\frac {\nu }{2}}+0.5)))}}$

In other words, for <27.5 21 19 23.6 17 17.9 16.9 20.1 21.9 22.6 23.1 19.6 19 21.7 21.4> and <27.1 22 20.8 23.4 23.4 23.5 25.8 22 24.8 20.2 21.9 22.1 22.9 20.5 24.4> the task description would have us calculate a value of 0.989311 but the implementations give a value of 0.021378. And you can easily see this in the code -

<lang c> double return_value = ((h / 6.0) * ((pow(x,a-1))/(sqrt(1-x)) + 4.0 * sum1 + 2.0 * sum2))/(expl(lgammal(a)+0.57236494292470009-lgammal(a+0.5)))</lang>

There is no 1-expression here (except deeply inside parenthesis) and there is no divide by 2 or multiple 0.5 (again, except deeply inside parenthesis).

I think that either the task description needs to be changed to match the implementation, or the implementation needs to be changed to match the task description. --Rdm (talk) 07:47, 3 July 2015 (UTC)

Is there anything missing/in error that this page is still considered in draft mode?--Hailholyghost (talk) 13:40, 8 July 2015 (UTC)

Well... the lngamma algorithm will be important for anyone who doesn't have a native implementation of that. That could be a separate task, and linked in the task description, if you are not comfortable documenting it here. Paddy's suggestion of pseudocode is also worth considering (perhaps on a separate page such as Calculate_P-Value/Pseudocode linked into the task description?), though at this point there is perhaps enough real code that that is not such an issue? I guess, let's give some of the other people here time to weigh in on this... --Rdm (talk) 13:48, 8 July 2015 (UTC)

I saw that C99 standard math.h has lgammal, so I figured it was standard in every language like log or pow. However, I see now that lgammal is not standard in every language. Is there something from math.h that is like <lang c>

include <stdio.h>

long double lgammal (const long double input) { //... some math....

return result;

} </lang> that I could paste into my code?--Hailholyghost (talk) 18:22, 8 July 2015 (UTC)

@@ Line 96: / Line 96: @@
 : Well... the lngamma algorithm will be important for anyone who doesn't have a native implementation of that. That could be a separate task, and linked in the task description, if you are not comfortable documenting it here.  Paddy's suggestion of pseudocode is also worth considering (perhaps on a separate page such as [[Calculate_P-Value/Pseudocode]] linked into the task description?), though at this point there is perhaps enough real code that that is not such an issue? I guess, let's give some of the other people here time to weigh in on this... --[[User:Rdm|Rdm]] ([[User talk:Rdm|talk]]) 13:48, 8 July 2015 (UTC)
-I included math.h because I saw that C99 standard has lgammal, so I figured it was standard in every language like log or pow.  However, I see now that lgammal is not standard in every language.  Is there something from math.h that is like
+I saw that C99 standard math.h has lgammal, so I figured it was standard in every language like log or pow.  However, I see now that lgammal is not standard in every language.  Is there something from math.h that is like
 <lang c>
 #include <stdio.h>