User talk:WillNess

Hi WillNess, If reporting runtimes please restrict it to comparative times between two implementations using the same language and try and use loose comparative terms such as 'around twice' rather than actual times as anything other than that is very difficult to interpret w.r.t. run conditions, OS, ... And RC doesn't set out to be a site for comparative run-time evaluations.
Thanks, --Paddy3118 11:37, 5 September 2011 (UTC)

Hi, you made an edit to Hamming numbers with the comment "according to docs for tee() storage is NOT shared among its iterators". I would like to know what you mean. What does it mean that storage is not shared. --208.80.119.67 00:49, 9 September 2011 (UTC)

Documentation says it is "equivalent to":

<lang Python>def tee(iterable, n=2):

   it = iter(iterable)
   deques = [collections.deque() for i in range(n)]
   def gen(mydeque):
       while True:
           if not mydeque:             # when the local deque is empty
               newval = next(it)       # fetch a new value and
               for d in deques:        # load it to all the deques
                   d.append(newval)
           yield mydeque.popleft()
   return tuple(gen(d) for d in deques)

</lang>

I read it to show that each generator produced by tee() holds on to its own dequeue (list with pop() and append()). When a new item is taken from the original iterable, it is appended separately into each one of the dequeues. Were they all to share one dequeue amongst themselves, that would be a shared storage. popleft() wouldn't be always called, but only when the leftmost of internal iterators pointing into internal list would get advanced. WillNess 10:53, 9 September 2011 (UTC)

Good morning, Can you please undo the double-space style changes that you made on the Sieve thread-based solution? I find that kind of spacing much harder to read. Also, my intention in writing that code was to demonstrate Racket's ability to express solutions in different paradigms as well as show how such different solutions are really doing the same thing. For this reason I would like to keep unnecessary optimizations at a very minimal level. It's therefore questionable that the changes that you made (stepping by 2x, explicit 3 in the beginning) are worth it. Whatever speedup you get as a result of changing the argument order is definitely not worth it for this goal. So please avoid such optimizations -- not turning this website into a festival of obscure optimizations like the language shootout thing is one of its most appealing properties. Thanks, --Elibarzilay (talk) 03:58, 6 November 2015 (UTC)

Hi, Eli. About the spacing. For me, it is much more readable that way, and since I'm much less proficient than you are (no emotional charge here, just plain fact) at reading and writing Scheme/Racket, I can only assume that it will also be much easier for a casual reader, unfamiliar with the language -- which I assume is the target audience of this site. As for the evens "optimization", I'll try to revert it. -- WillNess (talk) 13:00, 6 November 2015 (UTC)

I remember now how that happened. You had (ints-from 3 2) there, and I just went along with your 2, instead of correcting it to 1. WillNess (talk) 13:39, 6 November 2015 (UTC)

I don't think that that code is mine, at least not directly (the whole sieve thing went out of control with all kinds of arguments around it, which lead to a pile of stuff that goes against simple examples, and I got tired of fighting it at some point). Still, I'd like to see it stays at roughly the same level with no further obscuring optimizations... As for the spacing, being a newbie or not is really not a good factor (if only because many real newbies writing single parens on a line in an attempt to mimic C), instead, it should be as standard looking as possible, and double spaces is something that is not. --Elibarzilay (talk) 19:38, 9 November 2015 (UTC)

here's your edit with the (ints-from 3 2) throughout, and some additional spacing in the else clauses which I interpreted as arbitrary space to achieve code alignment for visual effect. I'll remove the double spacing an retain that additional space in the else clauses, shortly. -- WillNess (talk) 20:54, 9 November 2015 (UTC)

The edits being done by me don't say much, since code was shuffled a lot, and it's likely that I copied and/or translated things while doing that. But whether I copied it or I did it, the general thing is that optimizations that obscure the code should be avoided here... Thanks for the edits! --Elibarzilay (talk) 22:45, 10 November 2015 (UTC)

I'm not seeking to assign any blame obviously, I just didn't know (then) how to interpret your intent and took it to the other side, I guess. But what you're saying makes perfect sense. Thanks, -- WillNess (talk) 19:37, 11 November 2015 (UTC)

No problems -- I just wanted to clarify how things got to where they are, and even though I did many edits the result is not something I'm too happy with... The original --simple-- code that I had in a way that nicely demonstrates the similarities between laziness, threads, and generators is now stashed in the obscurely named "Sequence of primes by Trial Division" -- the sieve code is the results of many butcherings of that code, once there were complaints that the simple version is not really an "Eratosthenes Sieve" (I'm still not convinced, the rules sound like something that was retrofitted to the greek guy). --Elibarzilay (talk) 01:02, 18 November 2015 (UTC)

About the difference, maybe you'll find these old lazy-Lisp definitions with rem (pg 3), 1976 and with (+) (pg 2), 1987 interesting. -- WillNess (talk) 11:56, 19 November 2015 (UTC)

IP block fixed.

The reason you saw the IP block was because MediaWiki inadvertently blocked the IP address of the Cloudflare backend you were connecting in through. --Michael Mol (talk) 19:12, 3 January 2016 (UTC)

Thanks!! -- WillNess (talk) 14:47, 17 January 2016 (UTC)

Space complexity improvement -- Hamming numbers

Hi, glad you enjoyed my tweak.

There is an edge case I don't think your argument checking covers: what if n is 0 or negative?

I believe the way you apply the tweak works out the same at the end as the way I do. I add the empirical error correction factor (which quickly becomes about a constant factor of the log of output number for even reasonable ranges) times the original estimate of the log value directly to the desired 'n' number as in the nth Hamming number to calculate a new estimated value as the upper bound of the band. The lower bound is then just the same difference lower (or twice the difference lower than the upper bound). Using my method, I don't have to produce fractions for both the upper bound and the difference.

As to the space complexity, since the log estimate is basically n^(1/3), and the error band is proportional to the log, then the space complexity is as calculated and quite quickly approaches this limit.

Of course the space complexity makes very little difference in speed just as the sorting algorithm chosen makes very little difference as operations are limited to the size of the error band which is negligible as compared to the amount of time spent in the inner loops; for very large ranges there is a slight slowdown from ideal time complexity as the size of the error band gets larger than the various CPU caches.

Yes, you are right that 32-bit Haskell is quite slow at handling Word64's, which in my version is why I limit the inner loop i, j, and k values to Word32 (which under the covers Integers are calculated as when the ranges are small enough). The GHC Native Code Generator (NCG) has a known problem, still there in version 8.0.1, especially for 32-bit code, in efficiently allocating registers that is known to be at least twice as slow as 64-bit code for many algorithms, so it is actually amazing that you managed to make this run as fast as you did on IdeOne under 32-bit. It looks like the extra slow-down for 320bit code is computing the count for each value, which for Integers reverts to something close to 32 bit values when the ranges are small enough instead of always having to compute by two chunks.

I've always avoided using Int's where I can because they used to have a few tag bits that limited their range for 32 bit use even more, but it seems were new versions of GHC these tag bits are no longer used.

The main time expended is due to the inner loops, and even in 64-bit code GHC is still slower by a constant factor than many other languages, even lowly C#/F#, likely due to the time spent building the (lazy) error list instead of using arrays. I think that one could write code that would be as fast as the others, but one would have to define an instance of an STUArray to hold the unboxed tuples used (perhaps as their own packed type class instead of tuples) and the code would get messy for submission to RosettaCode. For my F# version I use non-lazy lists (at a constant factor cost in memory use for the tail links) very effectively, running almost as fast as a version using mutable arrays such as C#. The fastest versions in various languages fun about twice as fast as the Haskell version (64-bit code).

Yes, we have had the discussion about type signatures before re: Sieve of Eratosthenes; there is a reason they are recommended in the Haskell style guides for outer definitions.

You know me from previous discussions on this and other forums: I like maximum efficiency from any given algorithm including this one, and I think that with this tweak we are pretty well there as far as the algorithm goes. However, there are still some problems with your Haskell implementation as posted on the page when I run it on my actual Windows machine. First, it currently doesn't compile as it is missing some changes you intended to make in moving it from the IdeOne listing: missing 'ww' and changes between estval2 back to estval and so on. There is still a time slow down due to garbage collection: with both 64 and 32 bit code, your non-foldl' version spends over half the time garbage collecting on my machine due to the list comprehensions used and even without the garbage collection is still slow - my version can calculate the trillionth number in just over five seconds for 64/32 using Integer instead of Word64 whereas yours takes about three times as long plus the garbage collection. It looks like you should correct the RosettaCode version to the same foldl' version you use on IdeOne. Even with foldl' to prevent the effects of being non-strict, the overhead of using the list comprehensions still makes your version in 64-bit code about twice as slow as my version. My test program on IdeOne is at http://ideone.com/1Np5Ik and runs just a little bit faster than yours in 32-bit code with Integer but quite a bit faster under 64-bit code with Word64. --GordonBGood (talk) 08:10, 19 August 2016 (UTC)

Hi, thanks for the explanations, and for spotting that naming error. The code now runs as is, http://ideone.com/GXh4P0. It is about 30% slower than foldl' version, indeed. As for n<=0, it is invalid input. n is expected to be 1-based. Probably will have to edit this all in, yes. :) -- WillNess (talk) 13:18, 19 August 2016 (UTC)

I've now re-coded foldl' as loops too and got essentially same timings as your entry on Ideone (about 20% speedup; won't edit it in, to keep the clarity of the current version). But, there is still potential for improvement: the k-j enumeration is easily parallelizable: partition the `k`s such that k-j triangle is divided into 4 (8, ...) strips of roughly equal areas, and process each bundle in parallel. Another, really vague idea I once had, can be seen here, but it's really vague. :) -- WillNess (talk) 11:07, 20 August 2016 (UTC)

I think you'll find that the loops encoding is quite a bit more than a 20% to 30% speed-up when using 64-bit code and Word64 for the internal count computation - more like 200% (at least on my AMD Bulldozer which isn't very efficient at using CPU cache). The reasons for the extreme difference are several: 1) simulated 64 bit operations in 32-bit code are extremely inefficient in current GHC due to the extremely poor register allocations, which means there will be a large number of register spills and reloads to/from RAM or at least CPU cache, and 2) the inner loop consists of fairly simple computations but uses quite a large number of values, including j, jlmt, lb3, hi, w (or lo in my code), q, frac, r, and i (in both integer and float form), which if not optimized properly as to register use require a lot more registers than the x86 architecture provides (even worse in current GHC's view). When each of these are one 64-bit register, the inner loop runs very fast in just 15 CPU clock cycles or so for the loop encoding without the list comprehensions overhead, but this is multiplied when using memory and multiplied yet again when some of the operations need to use 64-bit precision (for counting) by simulating from 32-bit so that the list comprehensions overhead isn't as significant. Using the LLVM backend might help some here even for 32-bit code in more efficient register use, but of course we can't use that on IdeOne (AFAIK). Note that I haven't been able to test LLVM use either as under Windows LLVM doesn't work with 64-bit and I can't get it working with 32-bit for this program (promised to be fixed in GHC version 8.2)... --GordonBGood (talk) 01:20, 21 August 2016 (UTC)

Yes, of course parallelization of the outer loop is easily possible and would speed the code up by a constant factor, but is probably a little much for a RosettaCode posting. --GordonBGood (talk) 01:20, 21 August 2016 (UTC)

Interesting stuff, thanks. Of course I always expect the more abstract source code to be compiled into as efficient low-level code as possible; that was The High Premise of Equational Reasoning as I understood it. All this toiling with the minutiae of coding is off-putting for me. I'd rather prefer (sum *** concat) . unzip $ [.....] instead. :) :) -- WillNess (talk) 09:03, 21 August 2016 (UTC)

That's a beautiful expression of the code, but it has a horrible space leak such that it's "out of memory" for anything much above 10 billion and takes about 80% of the time on garbage collection; I think the problem is that the generated zip list is entirely unevaluated (including the "add to list" predicate check) until used by the "sum" and "sortBy" functions - a problem that the "foldl'" version didn't have as it was forced to be strict. I'm afraid that while the THPoER highly abstract form can sometimes reduce complex code into something simple such that it runs faster, in this case, the abstract code comes at an extremely high cost, due to it practically being expressed as non-strict lists with the construction of such lists taking many extra cycles in the innermost loop. There are also the practical matters of memory use and garbage collection... Sometimes we need to look at code from a low level perspective to see what it is really doing; I'm afraid that in this case the non-strictness of Haskell has bitten you on the a... ;) --GordonBGood (talk) 03:56, 22 August 2016 (UTC)

Yes, I saw that error for above 10--20B; I decided to not care about it here, and to go with the shorter code to express the algorithm clearly. The "out of memory" bit comes from Ideone and I'm not entirely sure why, when it's still reporting the exact same memory consumption for all the cases below 10B. Real space leaks are usually seen on Ideone as reported memory climbing up into the 100s MBs, not so here. Any case, it's a compiler issue. And, we have your version on the page now for the optimized code. :) -- WillNess (talk) 06:50, 22 August 2016 (UTC)

Just for completeness, the "out of memory" error isn't just IdeOne but also with Windows GHC version 8.0.1 for values approaching a trillion on my machine (which has a lot of memory; I haven't tried to determine the exact limit); the < 20 billion limit on IdeOne is due to the amount of memory assigned. I suspect it isn't flagged as a "real" space leak because there is not a fixed pointer to the unevaluated zipped list; there is just the unevaluated list comprehensions. When the "sum" operation is run, it forces the partial evaluation of every list element to the first part of the tuple but not the evaluation of the predicate on whether a given element gets including in the concatenated list or not; this last doesn't get evaluated until the determined values get sorted, at which point the unused values get dumped (and garbage collected). So, during the course of calculation to a trillion, something like 100's of Gigabytes would be consumed (my machine doesn't have that many and thus fails). Thus, during the execution of the program the memory use goes from zero to very high but is back to very low again by the time the program ends. Although the expression of the algorithm is beautiful, the memory use profile is not! And yes, my version shows what can be done, including that if IdeOne were 64-bit, 10 billion should take less than 150 milliseconds, not about a half second for the "foldl'" version, not about 1.5 seconds for this version, and that maximum memory residency (as determined by "+RTS -s") is almost zero not many many megabytes. --GordonBGood (talk) 08:28, 22 August 2016 (UTC)

Ah, great, thanks for that. Yes, I understand the space leak here, that's why I wrote those other versions back then; just thought maybe they fixed it by now with the newer GHC somehow. You're right, all this should be made clear in the text. -- WillNess (talk) 10:34, 22 August 2016 (UTC)

I've tried it locally, and it reports "234MB memory in use" for 1B and 1187MB for the 10B. I don't understand what is this 7.8MB figure that Ideone shows for both. I was misled by this. -- WillNess (talk) 10:45, 22 August 2016 (UTC)

BTW the really short version of it is just (Sum c,b) = mconcat [ ( Sum (i+1), .... Here it could be argued that it ought to be compiled efficiently, as the whole premise of Monoids is that associativity enables re-parenthesization (a:b:c:... = [a]++([b]++([c]++...)) = ([a]++[b])++([c]++...)), which is the basis for the efficiency of the strict left fold. Yet it is more than twice worse than the prod code, both in time and space.

BTW, I got 1B:0.45s-269MB and 10B:1.75s-1343MB, which suggests 1T:38s-28.8GB "total memory in use". The fold-based version indeed did good at 1T:11.2s-10MB. I use the signature Int -> (Double, (Int,Int,Int)) which produced the fastest code in my tests (simple -O2). -- WillNess (talk) 17:13, 22 August 2016 (UTC)

Looks good; it's still about three times slower and uses more memory than 1T:3.75s:3MB "total memory in use" (from local) as for my version on IdeOne but it is good as referents will have a choice. I went back to using Word64 as the fastest (just a slightly higher range than Int) even for 32-bit code as it is faster, since it then uses the same type for the internal count. it looks like for all the Monoid stuff to work as you would like to see it there will have to be all kinds of specializations that haven't been added - complex stuff, and doing "concat" efficiently has always been a headache. --GordonBGood (talk) 00:50, 23 August 2016 (UTC)

Factors of an integer

The current algorithms factors_co and factors_o seem to give the wrong result. For example, factors_co 120 is missing the number 12 in its result. --Helge (talk) 13:15, 26 December 2016 (UTC)

Good catch, thanks! Will fix. -- WillNess (talk) 21:06, 26 December 2016 (UTC)

Done! Thanks again for spotting this. An added bonus was, the code simplified a bit, too. -- WillNess (talk) 07:43, 27 December 2016 (UTC)