User talk:Hailholyghost: Difference between revisions

From Rosetta Code
Content added Content deleted
Line 54: Line 54:
[[User:Eoraptor|Eoraptor]] ([[User talk:Eoraptor|talk]]) 15:54, 13 March 2018 (UTC)
[[User:Eoraptor|Eoraptor]] ([[User talk:Eoraptor|talk]]) 15:54, 13 March 2018 (UTC)
:Actually, SAS can do both with [https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#multtest_toc.htm PROC MULTTEST]: usually it will be used with a dataset, but it's possible to pass p-values as well. [[User:Eoraptor|Eoraptor]] ([[User talk:Eoraptor|talk]]) 15:56, 13 March 2018 (UTC)
:Actually, SAS can do both with [https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#multtest_toc.htm PROC MULTTEST]: usually it will be used with a dataset, but it's possible to pass p-values as well. [[User:Eoraptor|Eoraptor]] ([[User talk:Eoraptor|talk]]) 15:56, 13 March 2018 (UTC)

Hi Eoraptor, unfortunately the pvalues generated were from a random number generator in R, probably `rnorm` or something similar (I did this several months ago, I don't remember exactly).--[[User:Hailholyghost|Hailholyghost]] ([[User talk:Hailholyghost|talk]]) 19:55, 13 March 2018 (UTC)

Revision as of 19:55, 13 March 2018

Welch's t-test

Hello, you wrote: "Welch's t-test is only part of the calculation, it isn't the purpose of the page."

What's the purpose of the page, then?

By the way, "Given two lists of data, calculate the p-value used for Calculation of P-value." is totally meaningless. Which p-value? There are zillions of p-value, associated with zillions of statistical tests. Basically all you have to know is the probability distribution of some statistic, but there are infinitely many of them: you are not going to ask for every possible distribution, so you have to choose one. Here, all the task in its current state is about Welch's t-test and how to compute the corresponding p-value, and yet you pretend it's not the purpose of the page. Puzzling.

Eoraptor (talk) 11:37, 8 December 2017 (UTC)

Welch's t-test is easy, the reason I created this page is because I had no idea how to calculate the integration, which is the point of the page. This page is meant to calculate this the same as R's "t.test(x,y,paired=FALSE)" If this page were about the t-test, as you say it is, the C code would have 1 line, and is completely trivial. The point of the page is to show how to do the non-trivial things: integration. I spent weeks figuring out how to do this, and your title change obfuscates my hard work.--Hailholyghost (talk) 12:51, 8 December 2017 (UTC)
Then this is a very badly designed task. There are already tasks about integration. Anyway, nobody would compute the p-value this way: there are good implementations of special functions, here the incomplete beta function, and usually they make use of specific properties of the functions, like series, continued fraction, optimal rational (Pade like) or polynomial (Chebyshev equioscillation) approximations, etc. If you want the p-value, the "standard" way is to find a special function library. Integration should be a last resort for Rosetta Code, for languages that do not have readily available special functions libraries. Common languages such as C, Fortran or Python all have something (I implemented the task in Python using numpy/scipy, and there is something in Fortran using IMSL and SLATEC, both well-known libraries), statistical packages even have builtin function for the whole test (see Stata and R here).
If you want to see how to use a generalist integration routine to compute the p-value, you are doing it wrong. If you want a general task on integration, you are doing it wrong too, and there is already a task about this.
You write the C code is trivial: really? How? You will need at least the incomplete beta function from GSL (or IMSL, NAG or any library with it, even a Fortran-based library), and this will require more than one line (you may do more or less what is done here in Fortran or Python). Even getting a working GSL on Windows is not trivial at all. And what you call the non-trivial thing is really not about integration at all.
Eoraptor (talk) 13:13, 8 December 2017 (UTC)
For instance, the R implementation uses a port of TOMS 708 from the ACM. The original code is here. Notice that, as usual with special functions, there are calls to different methods according the the values of the arguments. There are reasons not to reimplement this from scratch: it requires much research work to prove a given algorithm is correct, to find necessary coefficients with enough precision, and to get a correct and efficient implementation. Here you are not even trying to investigate the convergence of the integral, and it's an inefficient way to find an accurate answer. Eoraptor (talk) 13:30, 8 December 2017 (UTC)

"You will need at least the incomplete beta function" you are contradicting yourself. The Welch's t-test requires no integration. It took me a long time to implement this in portable GNU99 C, which wouldn't require library installations. This was a particularly daunting task, because I couldn't install the libraries- they were useless. I thought the world could benefit from my work. If you have some better way of doing this, fine, re-write the C code to implement the algorithms you mention.

Nonetheless, I don't see how to alter the title of the page. It appears that this page is stuck with a wrong title.

Oh, no integration, no beta incomplete function? That is going to be quite difficult. You need the beta incomplete function, which can be written as an integral. Either you integrate, or you apply clever techniques to compute the function without integration. But you have to do something. And what you ask for is not doable anyway (no "general p-value"). You don't seem to be understanding what you are trying to do wrong. Good luck. I give up. Eoraptor (talk) 15:46, 8 December 2017 (UTC)


Hi,

Suggested reading for your Python code: PEP 8 -- Style Guide for Python Code Most notably, tabs are not recommended but 4 spaces per tab are common, line length should be limited, no spaces after opening paren and before closing paren, no space between function name and following paren... While it may look stupid to you, these rules are applied in virtually every published Python code, and this is an important part of the readability of this language. There is also a minor bug at the beginning (two tests on ARRAY1_SIZE, none on ARRAY2_SIZE), and "while (1):" would rather be written "while True:". There are other oddities (inconsistent spacing around operators at least). If you don't mind, I could rewrite this in a more "pythonistic" way.

As an aside, it's an often overlooked matter (and Burkardt was guilty as well, as far as I was able to check) : translating from a language to another language requires more than just converting syntax. It's a criticism that was also made to the famous (or infamous) Numerical Recipes : the C version looked too much like the Fortran original, and not like usual C.

I also moved the numpy solution first: not that I am eager to put "my" solution first, but: 1/ It's much shorter, so the second solution is more visible if it's the low-level implementation 2/ It's not what would actually be done in Python, for several reasons: speed (use compiled code when it's available), but most importantly because statistical computations in Python are usually done with the numpy/scipy/pandas framework, which is much closer to what's available in, say, R, than "pure Python", which would require reinventing everything.

HTH

Eoraptor (talk) 13:08, 10 March 2018 (UTC)

Hi Eoraptor, thanks for your suggestions. I've tried to re-edit my code as much as I could. I am new, and am learning from old code. I don't know how I can make the code idiomatic. I make sure that the output is correct before I post the code. Please feel free to edit what I put, I won't be insulted.--Hailholyghost (talk) 20:01, 12 March 2018 (UTC)

Done. A few notes: I use Python 3 (the only visible difference here is I write print("something"), not print "something"). I put the betain in a separate function: easier to debug, easier to reuse. While the naming convention is debatable, I feel it's more readable to have short names in long formulas. The main welch_ttest function returns what is usually printed when doing that kind of test: statistic, degrees of freedom, p-value. I have limited the output to one example, but I'll add the other ones if you wish.
I translated the betain function from Stata, as it's very little work, but there is still a problem with the license. I asked the RSS, but got no response. My problem is: the Applied Statistics routines were written by various authors, however the RSS claims to hold copyright on all the code. Why not, but there is no mention of the exact license (except that code should be freely redistributed). The GPL is much more explicit, and I don't think Burkardt had the right to reditribute a translation of AS 63 under GPL (at least, I have no evidence showing he was entitled to do so). It's a well-known problem with ancient numerical codes. By ancient, I mean 70s and 80s: code was often distributed with no specific notion of license, nothing was ever explicit. And years later software engineering was more "formalized", and the status of old code was not very clear. It's one of the reason the GSL was created: a kind of modern SLATEC, with explicit GPL license. And because old code licenses were so fuzzy, the GSL guys decided not to reuse the "golden oldies" like SLATEC, CMLIB, NSWC, MATH77... those large generalist numerical libraries from the 70s, often written by US governement personel or at least with US government funding (so there should be no problem, however the copyright holders are usually not known, and there may be surprises). From time to time other free software projects consider using them, but the answer is always the same: too unclear, too risky.
Long story short: don't use these codes in software you plan to redistribute.
Hop this helps,
Eoraptor (talk) 21:54, 12 March 2018 (UTC)

P-value correction

Hi,

In the task there are a number of p-values. Would it be possible to reproduce the original datasets? Here is the reason: these adjustments are used for multiple comparisons, and there may be routines in statistical software to do that, including the adjustment, but starting from actual data, not p-values. While I understand that the task goal is to mimick R, it's a bit silly to have to rewrite everything if there is already basically the same functionality.

I didn't look at SAS documentation yet, but in Stata there is the dunntest package from Alexis Dinno, which does exactly that. See the help (in Stata's markup language, but it should be readable).

Eoraptor (talk) 15:54, 13 March 2018 (UTC)

Actually, SAS can do both with PROC MULTTEST: usually it will be used with a dataset, but it's possible to pass p-values as well. Eoraptor (talk) 15:56, 13 March 2018 (UTC)

Hi Eoraptor, unfortunately the pvalues generated were from a random number generator in R, probably `rnorm` or something similar (I did this several months ago, I don't remember exactly).--Hailholyghost (talk) 19:55, 13 March 2018 (UTC)