timings on RC
Hi, I note that in one of your examples you sprinkle timings. On RC we tend not to get too involved in mentioning runtimes unless it is exceptionally slow. It is best to restrict timings to maybe comparative timings between two algorithms in the same language, and only where necessary: "This new algorithm cut the run time from X hours to only Y minutes" kind of thing. Actual run times will be hard to duplicate by anyone else, but th scaling in run times might be achieved. The RC emphasis is on idiomatic code rather than the absolute fastest code. Most examples will not take hours to run in most programming languages, and most examples don't need timings.
Your Extensible Prime Generator Pascal contribution
You have obviously gone to great lengths to produce a fast version of this using Free Pascal, which caused me to look up what it is and found it interesting that it is a reincarnation and modernized version of Turbo Pascal/Delphi (most of the current contributors seem to have been not more than about 10 years old when Turbo Pascal first appeared!). Although I used Turbo Pascal commercially in about 1983 and later learned Delphi, it didn't cause me to want to abandon my current stable of more modern languages to take Pascal up again. But your taste seems to differ.
So as to your Sieve of Eratosthenes implementation:
- if I were you I would first contribute a version based on a hash map or deferred execution stream before opening up with the heavy guns; perhaps a version something like the Dart or Kotlin versions would suffice. This would satisfy those just looking for something that will help them solve Euler problems.
- There isn't a lot of point in making the generator much faster than where you are now as the time to enumerate found primes will be the main bottleneck - your version has to make two function calls per enumerated prime to the `NextPrime` and `PosOfPrime` functions and along with some processing by each, will take about 100 CPU cycles per call or something about a second for the primes to a billion and about a hundred times that for the primes to a hundred billion even if it takes zero time to do the actual sieving work. You could reduce the number of function calls per prime found by just maintaining a local counter variable of the primes found, which might speed enumeration up some.
- Thus, you can't ever compare your generator against something like Kim Walisch's primesieve which in its fastest mode doesn't enumerate but uses a popcount just to sum the bits in the bit-packed culled arrays.
- As to your general culling speed, you seem to know about using a bit-packed array (or maybe Pascal has a built-in version of a BitArray, I don't remember) and you use the optimization of filling from an array pre-culled by the small prime values to save about 70% of the work, but you still have a long way to go in improved culling unrolling techniques (a factor of more than two), maximum 2/3/5/7 wheel factorization (a further factor of three), multi-threading, and many other techniques as the ranges get larger including double culling, first by CPU L1 sizes and then a succession of those to make the CPU L2 cache sizes, and then a "bucket sieve" where that plays out, etc.
- I had a quick look in your code to see if you were setting the buffer size correctly, and notices a reference to "64K"; most CPU's have a maximum L1 DATA cache of 32 Kilobytes, and you shouldn't count on that if you are considering multi-threading on a hyper-threaded machine as each thread can then only use half - one can really only count on 16KB, which is why other techniques are required.
- At the top of the faster version you say "there is something wrong". What does that mean? Does the code have errors or inconsistencies? If so, you should state what the problem is.
- Ignoring the limitations of the "generator", it is possible to match or even beat "primesieve" using the techniques I mention above. I recently saw a post about a "ktprime" algorithm in C on Github that appears to do that, but it is about 2500 obscure lines of code and you may not want to look at what it does. I am currently working on a version that does the same and probably won't be much more than about 500 lines of easy-to-read Nim code or even less lines in harder to read Haskell code (if you aren't a functional programmer). If you are interested, let me know and I'll post a link here when one or the other is complete.
- I originally started some of this work in Google's Go language, but stopped working with that, as other than being easy to learn with a really nice to use co-routine (go routines) system, I find it to feel quite "old" to use and it really doesn't produce all that fast code since they decided to write their own compiler. I've learned that the fasted languages are those that either are just front ends to C as is Nim, or that use LLVM as can Haskell and as Rust and Julia do, as otherwise they are just re-inventing what has already been done so well and for so many years by the others. I don't know if Free Pascal attempts its own compiler or if it compares in raw speed with the others, and as I say I don't care enough to install it and try to find out. But you likely do ;)
I chose Haskell because I like it and wanted to show that it can produce code as fast as C/C++ and chose Nim because it is of a modern concise syntax yet is really just a sanitized pre-processor to C and thus is fairly easy to read for most more imperative programmers; also Nim needs a little more love as compared to the more well known new languages such as Rust
BTW, a good way to make speeds more comparable in a CPU agnostic way is to post them as CPU clock cycles per cull operation and CPU clock cycles per prime found - that removes much of the variation between comparisons that people using your examples would have although for some algorithms on some CPU's there is some variation be up to a factor of two due to the abilities of the CPU such as simultaneous micro instruction execution (not including hyper-threading or multi-threading).--GordonBGood (talk) 08:06, 6 January 2019 (UTC)
- I think that it may be possible to write a Free Pascal version of the Sieve of Eratosthenes that catches up to Kim Walisch's primesieve with the current fpc compiler:
- I run a very simple Sieve of Eratosthenes benchmark that tests raw maximally optimized (other than extreme loop unrolling as used in primesieve) loop speed; this implementation is run over only a 16 Kilobyte buffer of packed bits representing odds-only numbers as every modern CPU will have a L1 cache of at least that big, thus, it sieves for the primes up to 262146 with the answer 23000 (including the known only even prime of two).
- On Wandbox with full optimization, a Free Pascal version using the HEAD version 3.3.1 compiler takes only about 156 milliseoonds, there a C version using Clang HEAD version 8.0.0 takes about 147 milliseconds where the difference is probably just "line noise" but it also may be that Free Pascal still doesn't fully tune the order of likely the same machine instructions used. The Free Pascal compiler has been greatly improved since the last release as it takes about 250 milliseconds with the last version 3.0.2: I expect that it now combines the bit modification "k^ := k^ or msk;" into a single read modify write instruction rather than the former one instruction to read from memory, another to modify, and another to write the result back to memory. This last is what Google's Go language still does, which is one of the reasons it is fairly slow for this benchmark.
- A maximally optimized Sieve of Eratosthenes must use bit packing to make maximum use of the CPU caches, and (when optimized such as here), the cost of any extra operations to do with bit-packing is more than made up for in improved memory access times using the L1 cache. As Pascal has pointers as used here and with the newly optimized compiler, it should be able to take advantage of extreme loop unrolling as used by primesieve to gain a little more, especially for smaller base primes.
- It's been a "blast from the past" for me to re-visit Pascal after all these years in order to write this benchmark, but it has also made me realize how far language design has moved on since Turbo Pascal days when we used to rave about it: when compared to a modern language like Kotlin, or Nim, or with modern functional languages like Haskell or even more F#, there is just no going back for me as it feels "clunky". Verbosity in declaring var's separately from the code using them, no type inference whatsoever, begin/end blocks, etc. etc. Even Julia is better, although I don't like dynamic typing there are ways to get around it with some effort.--GordonBGood (talk) 04:43, 8 January 2019 (UTC)
signing the adding of text in discussions
When adding (your) text to topics on discussion pages, please sign (appended at the end of your additions) with:
- -- ~~~~
That is, two minus signs (or whatever), with four tildes (~). The latter adds your ID (name), and the UTC date and time. (There are other formats you can use.)
Brazilian numbers speed up
Thanks for your hints on speeding up the calculations of Brazilian numbers.