Legendre prime counting function

The prime-counting function π(n) computes the number of primes not greater than n. Legendre was the first mathematician to create a formula to compute π(n) based on the inclusion/exclusion principle.

To calculate:

Define

φ(x, 0) = x
φ(x, a) = φ(x, a−1) − φ(⌊x/p_a⌋, a−1), where p_a is the a^th prime number.

then

π(n) = 0 when n < 2
π(n) = φ(n, a) + a - 1, where a = π(√n), n ≥ 2

The Legendre formula still requires the use of a sieve to enumerate primes; however it's only required to sieve up to the √n, and for counting primes, the Legendre method is generally much faster than sieving up to n.

Task

Calculate π(n) for values up to 1 billion. Show π(n) for n = 1, 10, 100, ... 10⁹.

For this task, you may refer to a prime number sieve (such as the Sieve of Eratosthenes or the extensible sieve) in an external library to enumerate the primes required by the formula. Also note that it will be necessary to memoize the results of φ(x, a) in order to have reasonable performance, since the recurrence relation would otherwise take exponential time.

Comments on Task

Regarding "it will be necessary to memoize the results of φ(x, a)", it will have exponential time performance without memoization only if a very small optimization that should be obvious is not done: it should be obvious that one can stop "splitting" the φ(x, a) "tree" when 'x' is zero, and even before that since the real meaning of the "phi"/φ function is to produce a count of all of the values greater than zero (including one) up to `x` that have been culled of all multiples of the primes up to and including the `p_a` prime value, if `x` is less than or equal to `p_a`, then the whole "tree" must result in a value of just one. If this minor (and obvious) optimization is done, the "exponential time" performance goes away, memoization is not absolutely necessary (saving the overhead in time and space of doing the memoization), and the time complexity becomes O(n/(log n²) and the space complexity becomes O(n^1/2/log n) as they should be.

This is the problem when non-mathematician programmers blindly apply such a general formula as the recursive Legendre one without doing any work to understand it or even doing some trivial hand calculations to better understand how it works just because they have very powerful computers which mask the limitations: a few minutes of hand calculation would make it obvious that there is no need to "split"/recursively call for "phi" nodes where the first argument is zero, and someone with a mathematics interest would then investigate to see if that limit can be pushed a little further as here to the range of nodes whose result will always be one. Once there is no exponential growth of the number of "nodes", then there is no need for memoization as usually implemented with a hash table at a huge cost of memory overhead and constant time computation per operation.

As to "the Legendre method is generally much faster than sieving up to n.", while the number of operations for the Legendre algorithm is about a factor of `log n` squared less than the number of operations for odds-only Sieve of Eratosthenes (SoE) sieving, those operations are "divide" operations which are generally much slower than the simple array access and addition operations used in sieving and the SoE can be optimized further using wheel factorization so that the required time for a given range can be less for a fully optimized SoE than for the common implementation of the Legendre algorithm; the trade off is that a fully optimized SoE is at least 500 lines of code, whereas the basic version of the Legendre prime counting algorithm is only about 40 to 50 lines of code (depending somewhat on the language used).

Also note that the Legendre prime counting function was never used practically at the time it was invented other than to demonstrate that it would find the count of primes to a trivial range only knowing the primes up to the square root of that range and there were too many operations (especially long integer division operations) to actually use it for any reasonably range even with this optimization (about 250 thousand divisions to count primes to ten million), but the follow-on work by Meissel in the 1800's definitely would have used this optimization and others in order to hand calculate the number of primes to a billion (1e9) in about ten years. Even with this optimization, Meissel would have had to hand calculate over five million divisions, so certainly used other Look Up Tables (LUT's) although certainly not caching of Phi/φ values in order to reduce the work to something possible in this amount of time. A "TinyPhi" LUT table for the first six primes of thirteen and less would have reduced the amount of work Meissel did to about 600 thousand divisions, but even that would have been perhaps too much and it is very likely that he also used "partial sieving" techniques, although that would have meant that as well as a table of the primes up to a million, he would have also needed 161 other tables of that range to a million sieved by the primes up to 13, 17, 19, to 997; however, that extra work in building these tables (which might have been done mechanically) would pay off in reducing the number of divisions to about seven thousand so the divisions become a minor problem possible to do over months and the majority of the time would be spent producing the partial sieving tables up to a million.

The reason that Meissel refined the Legendre method would have been that, even applying all of the optimizations including "partial sieving", he would still have had to do about three and a half million divisions to count the primes to a billion even if the number of primes and "partial sieve tables" only needed to be known to about 32 thousand, where his "Meissel" algorithm reduced the number of divisions to only a few thousand as per the above. Without a computer, he could never have completed the calculation of the number of primes to a billion using an optimized Legendre algorithm where he could using his modification. However, modern computers make (reasonably) quick work of integer divisions so that optimized algorithms of the Legendre type become moderately useful although at the cost of memory use as compared to Meissel type algorithms.

C++

#include <cmath>
#include <iostream>
#include <vector>

std::vector<int> generate_primes(int limit) {
    std::vector<bool> sieve(limit >> 1, true);
    for (int p = 3, s = 9; s < limit; p += 2) {
        if (sieve[p >> 1]) {
            for (int q = s; q < limit; q += p << 1)
                sieve[q >> 1] = false;
        }
        s += (p + 1) << 2;
    }
    std::vector<int> primes;
    if (limit > 2)
        primes.push_back(2);
    for (int i = 1; i < sieve.size(); ++i) {
        if (sieve[i])
            primes.push_back((i << 1) + 1);
    }
    return primes;
}

class legendre_prime_counter {
public:
    explicit legendre_prime_counter(int limit);
    int prime_count(int n);
private:
    int phi(int x, int a);
    std::vector<int> primes;
};

legendre_prime_counter::legendre_prime_counter(int limit) :
    primes(generate_primes(static_cast<int>(std::sqrt(limit)))) {}

int legendre_prime_counter::prime_count(int n) {
    if (n < 2)
        return 0;
    int a = prime_count(static_cast<int>(std::sqrt(n)));
    return phi(n, a) + a - 1;
}

int legendre_prime_counter::phi(int x, int a) {
    if (a == 0)
        return x;
    if (a == 1)
        return x - (x >> 1);
    int pa = primes[a - 1];
    if (x <= pa)
        return 1;
    return phi(x, a - 1) - phi(x / pa, a - 1);
}

int main() {
    legendre_prime_counter counter(1000000000);
    for (int i = 0, n = 1; i < 10; ++i, n *= 10)
        std::cout << "10^" << i << "\t" << counter.prime_count(n) << '\n';
}

Output:

10^0	0
10^1	4
10^2	25
10^3	168
10^4	1229
10^5	9592
10^6	78498
10^7	664579
10^8	5761455
10^9	50847534

CoffeeScript

sorenson = require('sieve').primes  # Sorenson's extensible sieve from task: Extensible Prime Generator

# put in outer scope to avoid recomputing the cache
memoPhi = {}
primes = []

isqrt = (x) -> Math.floor Math.sqrt x

pi = (n) ->
    phi = (x, a) ->
        y = memoPhi[[x,a]]
        return y unless y is undefined

        memoPhi[[x,a]] =
            if a is 0 then x
            else
                p = primes[a - 1]
                throw "You need to generate at least #{a} primes." if p is undefined
                phi(x, a - 1) - phi(x // p, a - 1)

    if n < 2
        0
    else
        a = pi isqrt n
        phi(n, a) + a - 1

maxPi = 1e9
gen = sorenson()
primes = while (p = gen.next().value) < isqrt maxPi then p

n = 1
for i in [0..9]
    console.log "10^#{i}\t#{pi(n)}"
    n *= 10

Output:

10^0	0
10^1	4
10^2	25
10^3	168
10^4	1229
10^5	9592
10^6	78498
10^7	664579
10^8	5761455
10^9	50847534

Crystal

There doesn't seem to be much point to contributing memoized (Hash table) versions as they seem to be the result of the task author not realizing that there is a simple way of keeping the computation complexity from having exponential growth with counting range by limiting the "splitting" of the "phi" calculation when the result must be just one. Accordingly, the following Crystal versions are [translated from the non-memoizing Nim versions](https://rosettacode.org/wiki/Legendre_prime_counting_function#Non-Memoized_Versions), which explanations can be referenced for an understanding of how they work:

The Basic Recursive Legendre Algorithm

require "bit_array"

def count_primes(n : Int64)
  if n < 3_i64
    return 0_i64 if n < 2_i64
    return 1_i64
  end
  rtlmt = Math.sqrt(n.to_f64).to_i32
  mxndx = (rtlmt - 3) // 2
  cmpsts = BitArray.new(mxndx + 1)
  i = 0
  while true
    c = (i + i) * (i + 3) + 3
    break if c > mxndx
    unless cmpsts[i]
      bp = i + i + 3
      until c > mxndx
        cmpsts[c] = true
        c += bp
      end
    end
    i += 1
  end
  oprms = Array(Int32).new(cmpsts.count { |e| !e }, 0)
  pi = 0
  cmpsts.each_with_index do |e, i|
    unless e
      oprms[pi] = (i + i + 3).to_i32; pi += 1
    end
  end
  phi = uninitialized Proc(Int64, Int32, Int64) # recursion target!
  phi = ->(x : Int64, a : Int32) {
    return x - (x >> 1) if a < 1
    p = oprms.unsafe_fetch(a - 1)
    return 1_i64 if x <= p
    phi.call(x, a - 1) - phi.call((x.to_f64 / p.to_f64).to_i64, a - 1)
  }
  phi.call(n, oprms.size) + oprms.size
end

start_time = Time.monotonic
(0 .. 9).each { |i| puts "π(10**#{i}) = #{count_primes(10_i64**i)}" }
elpsd = (Time.monotonic - start_time).total_milliseconds

puts "This took #{elpsd} milliseconds."

Output:

π(10**0) = 0
π(10**1) = 4
π(10**2) = 25
π(10**3) = 168
π(10**4) = 1229
π(10**5) = 9592
π(10**6) = 78498
π(10**7) = 664579
π(10**8) = 5761455
π(10**9) = 50847534
This took 272.428954 milliseconds.

The time is as run on an Intel i5-6500 (3.6 GHz single-threaded boost).

Although this is about ten times faster than if it were using memoization (and uses much less memory), it is only reasonably usable for trivial ranges such as this (trivial in terms of prime counting functions).

The Less Recursive "Bottom-Up" with a TinyPhi LUT Algorithm

Just substitute the following Crystal code for the `count_primes` function above to enjoy the gain in speed:

Tiny_Phi_Primes = [ 2, 3, 5, 7, 11, 13 ]
Tiny_Phi_Odd_Circ = Tiny_Phi_Primes.product // 2
Tiny_Phi_Tot = Tiny_Phi_Primes.reduce(1) { |acc, p| acc * (p - 1) }
CC = Tiny_Phi_Primes.size - 1
def make_Tiny_Phi_LUT()
  rslt = Array(UInt16).new(Tiny_Phi_Odd_Circ, 1_u16)
  Tiny_Phi_Primes.skip(1).each { |bp|
      i = (bp - 1) >> 1; rslt[i] = 0; c = (i + i) * (i + 1)
      while c < Tiny_Phi_Odd_Circ
        rslt[c] = 0; c += bp
      end }
  acc = 0_u16; i = 0
  while i < Tiny_Phi_Odd_Circ
    acc += rslt[i]; rslt[i] = acc; i += 1
  end
  rslt
end
Tiny_Phi_LUT = make_Tiny_Phi_LUT()
@[AlwaysInline]
def tiny_Phi(x : Int64) : Int64
  ndx = (x - 1) >> 1; numtot = ndx // Tiny_Phi_Odd_Circ.to_i64
  tpli = ndx - numtot * Tiny_Phi_Odd_Circ.to_i64
  numtot * Tiny_Phi_Tot.to_i64 +
    Tiny_Phi_LUT.unsafe_fetch(tpli).to_i64
end

def count_primes(n : Int64)
  if n < 169_i64 # below 169 whose sqrt is 13 is where TinyPhi doesn't work...
    return 0_i64 if n < 2_i64
    return 1_i64 if n < 3_i64
    # adjust for the missing "degree" base primes
    return 1 + (n - 1) // 2 if n < 9_i64
    return (n - 1) // 2 if n <= 13_i64
    return 5 + Tiny_Phi_LUT[(n - 1).to_i32 // 2].to_i64
  end
  rtlmt = Math.sqrt(n.to_f64).to_i32
  mxndx = (rtlmt - 3) // 2
  cmpsts = BitArray.new(mxndx + 1)
  i = 0
  while true
    c = (i + i) * (i + 3) + 3
    break if c > mxndx
    unless cmpsts[i]
      bp = i + i + 3
      until c > mxndx
        cmpsts[c] = true
        c += bp
      end
    end
    i += 1
  end
  oprms = Array(Int32).new(cmpsts.count { |e| !e }, 0)
  opi = 0
  cmpsts.each_with_index do |e, i|
    unless e
      oprms[opi] = (i + i + 3).to_i32; opi += 1
    end
  end
  lvl = uninitialized Proc(Int32, Int32, Int64, Int64) # recursion target!
  lvl = ->(pilo : Int32, pilmt : Int32, m : Int64) : Int64 {
    pi = pilo; answr = 0_i64
    while pi < pilmt
      p = oprms.unsafe_fetch(pi).to_i64; nm = p * m
      return answr + (pilmt - pi) if n <= nm * p
      q = (n.to_f64 / nm.to_f64).to_i64; answr += tiny_Phi(q)
      answr -= lvl.call(CC, pi, nm) if pi > CC
      pi += 1
    end
    answr
  }
  tiny_Phi(n) - lvl.call(CC, oprms.size, 1_i64) + oprms.size
end

Use of the above function gets a gain in speed of about a further ten times over the above version due to less recursion, the use of the "Tiny_Phi" Look Up Table (LUT) for smaller "degrees" of primes which greatly reduces the number of integer divisions, and by using floating point division for integer division because it is faster, although it is limited in precision to 53-bits, one wouldn't want to use these algorithms over a range where that would cause a problem. This version of prime counting function might be considered to be reasonably useful for counting primes up to a trillion (1e12) in a few tens of seconds.

The Non-Recursive Partial Sieving Algorithm

Just substitute the following Crystal code for the `count_primes` function above to enjoy the further gain in speed:

def count_primes(n : Int64) : Int64
  if n < 3
    if n < 2
      return 0_i64
    else
      return 1_i64
    end
  end
  half = ->(n : Int64) : Int64 { (n - 1) >> 1 }
  divide = ->(n : Int64, d : Int64) : Int64 { (n.to_f64 / d.to_f64).to_i64 }
  rtlmt = Math.sqrt(n.to_f64).to_i32; mxndx = (rtlmt - 1) // 2
  cmpsts = BitArray.new(mxndx + 1)
  smalls = Array(Int32).new(mxndx + 1) { |i| i }
  roughs = Array(Int32).new(mxndx + 1) { |i| i + i + 1 }
  larges =
    Array(Int64).new(mxndx + 1) { |i| ((n // (i + i + 1)).to_i64 - 1) >> 1 }
  i = 1; nbps = 0; mxri = mxndx
  while true
    c = (i + i) * (i + 1); break if c > mxndx
    if !cmpsts.unsafe_fetch(i)
      bp = i + i + 1; cmpsts.unsafe_put(i, true)
      until c > mxndx
        cmpsts.unsafe_put(c, true); c += bp
      end # partial sieving for bp completed here!

      j = 0; ri = 0 # adjust `larges` according to partial sieve...
      while j <= mxri
        q = roughs.unsafe_fetch(j); qi = q >> 1
        if !cmpsts.unsafe_fetch(qi)
          d = bp.to_i64 * q.to_i64
          larges.unsafe_put(ri, larges.unsafe_fetch(j) -
                                  if d <= rtlmt.to_i64
                                    ndx = smalls.unsafe_fetch(d >> 1) - nbps
                                    larges.unsafe_fetch(ndx)
                                  else
                                    ndx = half.call(divide.call(n, d))
                                    smalls.unsafe_fetch(ndx)
                                  end + nbps)
          roughs.unsafe_put(ri, q); ri += 1
        end; j += 1
      end

      si = mxndx; bpm = (rtlmt // bp - 1) | 1
      while bpm >= bp # adjust smalls according to partial sieve...
        c = smalls.unsafe_fetch(bpm >> 1) - nbps; e = (bpm * bp) >> 1
        while si >= e
          smalls.unsafe_put(si, smalls.unsafe_fetch(si) - c); si -= 1
        end
        bpm -= 2
      end

      mxri = ri - 1; nbps += 1
    end; i += 1
  end

  ans = larges.unsafe_fetch(0); i = 1
  while i <= mxri # combine results; adjust for over subtraction base primes...
    ans -= larges.unsafe_fetch(i); i += 1
  end
  ans += (mxri.to_i64 + 1 + 2 * (nbps.to_i64 - 1)) * mxri.to_i64 // 2 # adjust!

  ri = 1 # do final phi calculation for pairs of larger primes...
  while true # break on condition when up to cube root of range!
    p = roughs.unsafe_fetch(ri).to_i64; q = n // p
    e = smalls.unsafe_fetch(half.call(divide.call(q, p))) - nbps
    break if e <= ri; ori = ri + 1
    while ori <= e
      ndx = half.call(divide.call(q, roughs.unsafe_fetch(ori).to_i64))
      ans += smalls.unsafe_fetch(ndx).to_i64; ori += 1
    end
    ans -= (e - ri).to_i64 * (nbps.to_i64 + ri.to_i64 - 1); ri += 1
  end

  ans + 1 # for only even prime of two!
end

The above function enjoys quite a large gain in speed of about a further ten times over the immediately preceding version due to not using recursion at all and the greatly reduced computational complexity of O(n**(3/4)/((log n)**2)) instead of the almost-linear-for-large-counting-ranges O(n/((log n)**2)) for the previous two versions, although the previous two versions differ in performance by a constant factor due to the overheads of recursion and division. This version of prime counting function might be considered to be reasonably useful for counting primes up to a 1e16 in almost two minutes as long as the computer used has the required free memory of about 800 Megabytes. This Crystal version is about twice as slow as the Nim version from which it was translated because Nim's default GCC back-end is better at optimization (at least for this algorithm) than the LLVM back-end used by Crystal and also there is more numeric calculation checking such as numeric overflow used in Crystal that can't be turned off. At that, this version is about the same speed as when translated to Rust, which also uses a LLVM back-end.

Erlang

-mode(native).

-define(LIMIT, 1000000000).

main(_) ->
    put(primes, array:from_list(primality:sieve(floor(math:sqrt(?LIMIT))))),
    ets:new(memphi, [set, named_table, protected]),
    output(0, 9).

nthprime(N) -> array:get(N - 1, get(primes)).

output(A, B) -> output(A, B, 1).
output(A, B, _) when A > B -> ok;
output(A, B, N) ->
    io:format("10^~b    ~b~n", [A, pi(N)]),
    output(A + 1, B, N * 10).

pi(N) ->
    Primes = get(primes),
    Last = array:get(array:size(Primes) - 1, Primes),
    if
        N =< Last -> small_pi(N);
        true ->
            A = pi(floor(math:sqrt(N))),
            phi(N, A) + A - 1
    end.

phi(X, 0) -> X;
phi(X, A) ->
    case ets:lookup(memphi, {X, A}) of
        [] ->
            Phi = phi(X, A-1) - phi(X div nthprime(A), A-1),
            ets:insert(memphi, {{X, A}, Phi}),
            Phi;

        [{{X, A}, Phi}] -> Phi
    end.

% Use binary search to count primes that we have already listed.
small_pi(N) ->
    Primes = get(primes),
    small_pi(N, Primes, 0, array:size(Primes)).

small_pi(_, _, L, H) when L >= (H - 1) -> L + 1;
small_pi(N, Primes, L, H) ->
    M = (L + H) div 2,
    P = array:get(M, Primes),
    if
        N > P -> small_pi(N, Primes, M, H);
        N < P -> small_pi(N, Primes, 0, M);
        true -> M + 1
    end.

Output:

10^0    1
10^1    4
10^2    25
10^3    168
10^4    1229
10^5    9592
10^6    78498
10^7    664579
10^8    5761455
10^9    50847534

Go

Translation of: Wren

Library: Go-rcu

Memoized

This runs in about 1.9 seconds which is surprisingly slow for Go. However, Go has the same problem as Wren in not being able to use lists for map keys. I tried using a 2-element array for the key but this was a second slower than the Cantor function.

The alternative of sieving a billion numbers and then counting them takes about 4.8 seconds using the present library function though this could be cut to 1.2 seconds using a third party library such as primegen.

package main

import (
    "fmt"
    "log"
    "math"
    "rcu"
)

func cantorPair(x, y int) int {
    if x < 0 || y < 0 {
        log.Fatal("Arguments must be non-negative integers.")
    }
    return (x*x + 3*x + 2*x*y + y + y*y) / 2
}

func pi(n int) int {
    if n < 2 {
        return 0
    }
    if n == 2 {
        return 1
    }
    primes := rcu.Primes(int(math.Sqrt(float64(n))))
    a := len(primes)
    memoPhi := make(map[int]int)

    var phi func(x, a int) int // recursive closure
    phi = func(x, a int) int {
        if a < 1 {
            return x
        }
        if a == 1 {
            return x - (x >> 1)
        }
        pa := primes[a-1]
        if x <= pa {
            return 1
        }
        key := cantorPair(x, a)
        if v, ok := memoPhi[key]; ok {
            return v
        }
        memoPhi[key] = phi(x, a-1) - phi(x/pa, a-1)
        return memoPhi[key]
    }

    return phi(n, a) + a - 1
}

func main() {
    for i, n := 0, 1; i <= 9; i, n = i+1, n*10 {
        fmt.Printf("10^%d  %d\n", i, pi(n))
    }
}

Output:

10^0  0
10^1  4
10^2  25
10^3  168
10^4  1229
10^5  9592
10^6  78498
10^7  664579
10^8  5761455
10^9  50847534

Non-memoized

Much quicker than before (0.53 seconds) and, of course, uses less memory.

package main

import (
    "fmt"
    "math"
    "rcu"
)

func pi(n int) int {
    if n < 2 {
        return 0
    }
    if n == 2 {
        return 1
    }
    primes := rcu.Primes(int(math.Sqrt(float64(n))))
    a := len(primes)

    var phi func(x, a int) int // recursive closure
    phi = func(x, a int) int {
        if a < 1 {
            return x
        }
        if a == 1 {
            return x - (x >> 1)
        }
        pa := primes[a-1]
        if x <= pa {
            return 1
        }
        return phi(x, a-1) - phi(x/pa, a-1)
    }

    return phi(n, a) + a - 1
}

func main() {
    for i, n := 0, 1; i <= 9; i, n = i+1, n*10 {
        fmt.Printf("10^%d  %d\n", i, pi(n))
    }
}

Output:

Same as first version.

Java

import java.util.*;

public class LegendrePrimeCounter {
    public static void main(String[] args) {
        LegendrePrimeCounter counter = new LegendrePrimeCounter(1000000000);
        for (int i = 0, n = 1; i < 10; ++i, n *= 10)
            System.out.printf("10^%d\t%d\n", i, counter.primeCount((n)));
    }

    private List<Integer> primes;

    public LegendrePrimeCounter(int limit) {
        primes = generatePrimes((int)Math.sqrt((double)limit));
    }

    public int primeCount(int n) {
        if (n < 2)
            return 0;
        int a = primeCount((int)Math.sqrt((double)n));
        return phi(n, a) + a - 1;
    }

    private int phi(int x, int a) {
        if (a == 0)
            return x;
        if (a == 1)
            return x - (x >> 1);
        int pa = primes.get(a - 1);
        if (x <= pa)
            return 1;
        return phi(x, a - 1) - phi(x / pa, a - 1);
    }

    private static List<Integer> generatePrimes(int limit) {
        boolean[] sieve = new boolean[limit >> 1];
        Arrays.fill(sieve, true);
        for (int p = 3, s = 9; s < limit; p += 2) {
            if (sieve[p >> 1]) {
                for (int q = s; q < limit; q += p << 1)
                    sieve[q >> 1] = false;
            }
            s += (p + 1) << 2;
        }
        List<Integer> primes = new ArrayList<>();
        if (limit > 2)
            primes.add(2);
        for (int i = 1; i < sieve.length; ++i) {
            if (sieve[i])
                primes.add((i << 1) + 1);
        } 
        return primes;
    }
}

Output:

10^0	0
10^1	4
10^2	25
10^3	168
10^4	1229
10^5	9592
10^6	78498
10^7	664579
10^8	5761455
10^9	50847534

Haskell

Memoization utilities:

data Memo a = Node a (Memo a) (Memo a)
  deriving Functor

memo :: Integral a => Memo p -> a -> p
memo (Node a l r) n
  | n == 0 = a
  | odd n = memo l (n `div` 2)
  | otherwise = memo r (n `div` 2 - 1)

nats :: Integral a => Memo a
nats = Node 0 ((+1).(*2) <$> nats) ((*2).(+1) <$> nats)

memoize :: Integral a => (a -> b) -> a -> b
memoize f = memo (f <$> nats)

memoize2 :: (Integral a, Integral b) => (a -> b -> c) -> a -> b -> c
memoize2 f = memoize (memoize . f)

memoList :: [b] -> Integer -> b
memoList = memo . mkList
  where
    mkList (x:xs) = Node x (mkList l) (mkList r)
      where (l,r) = split xs
            split [] = ([],[])
            split [x] = ([x],[])
            split (x:y:xs) = let (l,r) = split xs in (x:l, y:r)

Computation of Legendre function:

isqrt :: Integer -> Integer
isqrt n = go n 0 (q `shiftR` 2)
 where
   q = head $ dropWhile (< n) $ iterate (`shiftL` 2) 1
   go z r 0 = r
   go z r q = let t = z - r - q
              in if t >= 0
                 then go t (r `shiftR` 1 + q) (q `shiftR` 2)
                 else go z (r `shiftR` 1) (q `shiftR` 2)

phi = memoize2 phiM
  where
    phiM x 0 = x
    phiM x a = phi x (a-1) - phi (x `div` p a) (a - 1)
    
    p = memoList (undefined : primes)

legendrePi :: Integer -> Integer
legendrePi n
  | n < 2 = 0
  | otherwise = phi n a + a - 1
    where a = legendrePi (floor (sqrt (fromInteger n)))

main = mapM_ (\n -> putStrLn $ show n ++ "\t" ++ show (legendrePi (10^n))) [1..7]

J

This is "almost a primitive" in J:

   require'format/printf'
   {{echo '10^%d: %d' sprintf y;p:inv 10^y}}"0 i.10
10^0: 0
10^1: 4
10^2: 25
10^3: 168
10^4: 1229
10^5: 9592
10^6: 78498
10^7: 664579
10^8: 5761455
10^9: 50847534

A complete implementation of the legendre prime counting function would be (1&p: + p:inv):

   (1&p: + p:inv) 10
4
   (1&p: + p:inv) 11
5

But we can simplify that to p:inv when we know that primes are not being tested.

jq

Adapted from Wren

Works with: jq

Works with gojq, the Go implementation of jq (*)

The following implementation freely uses the following conveniences of jq syntax:

the jq expression {$x} is an abbreviation for {"x" : $x}
the jq expression {x} is an abbreviation for {"x" : .x}

The key-value pairs can be similarly specified in JSON objects with multiple keys.

(*) gojq struggles to get beyond [5,9592] without using huge amounts of memory; jq becomes excessively slow after [7,664579].

Preliminaries

# For the sake of the accuracy of integer arithmetic when using gojq:
def power($b): . as $in | reduce range(0;$b) as $i (1; . * $in);

# Input: $n, which is assumed to be positive integer,
# Output: an array of primes less than or equal to $n (e.g. 10|eratosthenes #=> [2,3,5,7]
def eratosthenes:
  # erase(i) sets .[i*j] to false for integral j > 1
  def erase(i):
    if .[i] then 
      reduce range(2; (1 + length) / i) as $j (.; .[i * $j] = false)
    else .
    end;

  (. + 1) as $n
  | (($n|sqrt) / 2) as $s
  | [null, null, range(2; $n)]
  | reduce (2, 1 + (2 * range(1; $s))) as $i (.; erase($i))
  | map(select(.)) ;

Phi and the Legendre Counting Function

def legendre:
  (sqrt | floor + 1 | eratosthenes) as $primes

  # Input: {x, a}
  # Output: {phi: phi(x,a), memo} where .memo might have been updated
  | def phi:
    . as {x: $x, a: $a, memo: $memo}
    | if $a == 0 then {phi: $x, $memo}
      else "\($x),\($a)" as $ix
      | $memo[ $ix ] as $m
      | if $m then {phi: $m, $memo}
        else .a += -1
        | phi as {phi: $phi1, memo: $memo}
        | (($x / $primes[$a - 1])|floor) as $x
        | ({$x, a, $memo} | phi) as {phi: $phi2, memo: $memo}
        | ($phi1 - $phi2) as $phi
        | {$phi, $memo}
	| .memo[$ix] = $phi
        end
      end;

  def l:
    . as $n
    | if . < 2 then 0
      else ($n|sqrt|floor|l) as $a
      | ({x: $n, $a, memo: {}} | phi).phi + $a - 1
      end;

  l;


def task:
  range(0;10)
  | . as $i
  | [., (10|power($i)|legendre)]
  ;

task

Output:

[0,0]
[1,4]
[2,25]
[3,168]
[4,1229]
[5,9592]
[6,78498]
[7,664579]
<terminated>

Julia

Using the Julia Primes library

This would be faster with a custom sieve that only counted instead of making a bitset, instead of the more versatile library function.

using Primes

function primepi(N)
    delta = round(Int, N^0.8)
    return sum(i -> count(primesmask(i, min(i + delta - 1, N))), 1:delta:N)
end

@time for power in 0:9
    println("10^", rpad(power, 5), primepi(10^power))
end

Output:

10^0    0
10^1    4
10^2    25
10^3    168
10^4    1229
10^5    9592
10^6    78498
10^7    664579
10^8    5761455
10^9    50847534
  1.935556 seconds (1.64 k allocations: 415.526 MiB, 2.15% gc time)

Using the method given in the task

Translation of: CoffeeScript

Run twice to show the benefits of precomputing the library prime functions.

using Primes

const maxpi = 1_000_000_000
const memoφ = Dict{Vector{Int}, Int}()
const pₐ = primes(isqrt(maxpi))

function π(n)
    function φ(x, a)
        if !haskey(memoφ, [x, a])
            memoφ[[x, a]] = a == 0 ? x : φ(x, a - 1) - φ(x ÷ pₐ[a], a - 1)
        end
        return memoφ[[x, a]]
    end

    n < 2 && return 0
    a = π(isqrt(n))
    return φ(n, a) + a - 1
end

@time for i in 0:9
    println("10^", rpad(i, 5), π(10^i))
end

@time for i in 0:9
    println("10^", rpad(i, 5), π(10^i))
end

Output:

10^0    1
10^1    4   
10^2    25  
10^3    168 
10^4    1229
10^5    9592
10^6    78498
10^7    664579
10^8    5761455
10^9    50847534
 12.939211 seconds (69.44 M allocations: 3.860 GiB, 27.98% gc time, 1.58% compilation time)
10^0    1
10^1    4
10^2    25
10^3    168
10^4    1229
10^5    9592
10^6    78498
10^7    664579
10^8    5761455
10^9    50847534
  0.003234 seconds (421 allocations: 14.547 KiB)

Mathematica/Wolfram Language

ClearAll[Phi,pi]
$RecursionLimit = 10^6;
Phi[x_, 0] := x
Phi[x_, a_] := Phi[x, a] = Phi[x, a - 1] - Phi[Floor[x/Prime[a]], a - 1]
pi[n_] := Module[{a}, If[n < 2, 0, a = pi[Floor[Sqrt[n]]]; Phi[n, a] + a - 1]]
Scan[Print[pi[10^#]] &, Range[0,9]]

Output:

Nim

The program runs in 2.2 seconds on our laptop. Using int32 instead of naturals (64 bits on our 64 bits computer) saves 0.3 second. Using an integer rather than a tuple as key (for instance x shl 32 or a) saves 0.4 second.

import math, strutils, sugar, tables

const
  N = 1_000_000_000
  S = sqrt(N.toFloat).int

var composite: array[3..S, bool]
for n in countup(3, S, 2):
  if n * n > S: break
  if not composite[n]:
    for k in countup(n * n, S, 2 * n):
      composite[k] = true

# Prime list. Add a dummy zero to start at index 1.
let primes = @[0, 2] & collect(newSeq, for n in countup(3, S, 2): (if not composite[n]: n))

var cache: Table[(Natural, Natural), Natural]

proc phi(x, a: Natural): Natural =
  if a == 0: return x
  let pair = (x, a)
  if pair in cache: return cache[pair]
  result = phi(x, a - 1) - phi(x div primes[a], a - 1)
  cache[pair] = result

proc π(n: Natural): Natural =
  if n <= 2: return 0
  let a = π(sqrt(n.toFloat).Natural)
  result = phi(n, a) + a - 1

var n = 1
for i in 0..9:
  echo "π(10^$1) = $2".format(i, π(n))
  n *= 10

Output:

π(10^0) = 0
π(10^1) = 4
π(10^2) = 25
π(10^3) = 168
π(10^4) = 1229
π(10^5) = 9592
π(10^6) = 78498
π(10^7) = 664579
π(10^8) = 5761455
π(10^9) = 50847534

Non-Memoized Versions

The above code takes a huge amount of memory to run, especially if using the full size "Naturals" which not using them will severely limit the usable counting range (which of course is limited anyway by the huge memory use by the memoization). As per the comments on the task, memoization is not strictly needed, as the exponential growth in execution time can be limited by a very simple optimization that stops splitting of the recursive "tree" when there are no more contributions to be had by further "leaves" to the right of the "tree", as per the following code:

# compile with: nim c -d:danger -t:-march=native --gc:arc

from std/monotimes import getMonoTime, `-`
from std/times import inMilliseconds
from std/math import sqrt

let masks = [ 1'u8, 2, 4, 8, 16, 32, 64, 128 ] # faster than bit twiddling
let masksp = cast[ptr[UncheckedArray[byte]]](unsafeAddr(masks[0]))

proc countPrimes(n: int64): int64 =
  if n < 3:
    return if n < 2: 0 else: 1
  else:
    let rtlmt = n.float64.sqrt.int
    let mxndx = (rtlmt - 1) div 2
    let sz = (mxndx + 8) div 8
    var cmpsts = cast[ptr[UncheckedArray[byte]]](alloc0(sz))
    for i in 1 .. mxndx:
      if (cmpsts[i shr 3] and masksp[i and 7]) != 0: continue
      let sqri = (i + i) * (i + 1)
      if sqri > mxndx: break
      let bp = i + i + 1
      for c in countup(sqri, mxndx, bp):
        let w = c shr 3; cmpsts[w] = cmpsts[w] or masksp[c and 7]
    var pisqrt = 0'i64
    for i in 0 .. mxndx:
      if (cmpsts[i shr 3] and masksp[i and 7]) == 0: pisqrt += 1
    var primes = cast[ptr[UncheckedArray[uint32]]](alloc(sizeof(uint32) * pisqrt.int))
    var j = 0
    for i in 0 .. mxndx:
      if (cmpsts[i shr 3] and masksp[i and 7]) == 0: primes[j] = (i + i + 1).uint32; j += 1
    proc phi(x: int64; a: int): int64 =
      if a <= 1:
        return if a < 1: x else: x - (x shr 1)
      let p = primes[a - 1].int64
      if x <= p: return 1 # very simple one-line optimization that limits exponential growth!
      return phi(x, a - 1) - phi((x.float64 / p.float64).int64, a - 1)
    result = phi(n, pisqrt.int) + pisqrt - 1
    cmpsts.dealloc; primes.dealloc

let nstrt = getMonoTime()
var pow = 1'i64
for i in 0 .. 9: echo "π(10^", i, ") = ", pow.countPrimes; pow *= 10
let nelpsd = (getMonoTime() - nstrt).inMilliseconds
echo "This took ", nelpsd, " milliseconds."

Output:

π(10^0) = 0
π(10^1) = 4
π(10^2) = 25
π(10^3) = 168
π(10^4) = 1229
π(10^5) = 9592
π(10^6) = 78498
π(10^7) = 664579
π(10^8) = 5761455
π(10^9) = 50847534
This took 194 milliseconds.

As shown in the output, the above program runs without memoization about 12 times faster and using a huge amount less memory than the previous memoized algorithm. Other than not requiring memoization, the implementation also substitutes floating point division for integer division so as to be faster for many CPU architectures (although only being of 53-bits accuracy instead of 64-bits; which limited precision will limit the range over which the primes can be accurately counted, but as this algorithm is only reasonably useful to about ten to the fourteenth power anyway, that shouldn't be a problem.

Using Look Up Table for "TinyPhi" and Decreasing Recursion

A further optimization that greatly reduces the number of operations and thus the run time for these small ranges is to use a pre-computed "TinyPhi" Look Up Table (LUT) array for the "leaves" where the "degree" is small (such as degree of six for the smallest primes used as 13 as used here, which then forms a repeating wheel pattern of 30030 values); this would reduce the "width" of the tree at its widest "base" to eliminate many divisions and recursions and thus reduce the execution time greatly, although the effect is more for smaller ranges such as this as for greater ranges. As well, one can compute the "tree" starting at the lower left, which then reduces some of the recursions to a simple loop and can eliminate the recursions used for succeeding "ones" according to the optimization eliminating the need for more recursion. This is because "Phi"/φ can be calculated by the given "top down" recursion expression given in the task which is derived from the "bottom up" expression as follows:

\lfloor x\rfloor -\sum _{i}\left\lfloor {\frac {x}{p_{i}}}\right\rfloor +\sum _{i<j}\left\lfloor {\frac {x}{p_{i}p_{j}}}\right\rfloor -\sum _{i<j<k}\left\lfloor {\frac {x}{p_{i}p_{j}p_{k}}}\right\rfloor +\cdots

(where $\lfloor {x}\rfloor$ denotes the floor function).

The following Nim code implements these changes:

# compile with: nim c -d:danger -t:-march=native --gc:arc

from std/monotimes import getMonoTime, `-`
from std/times import inMilliseconds
from std/math import sqrt

let masks = [ 1'u8, 2, 4, 8, 16, 32, 64, 128 ] # faster than bit twiddling
let masksp = cast[ptr[UncheckedArray[byte]]](unsafeAddr(masks[0]))

const TinyPhiPrimes = [2, 3, 5, 7, 11, 13]
const TinyPhiCirc = 3 * 5 * 7 * 11 * 13
const TinyPhiRes = 2 * 4 * 6 * 10 * 12
const CC = 6
proc makeTinyPhiLUT(): array[TinyPhiCirc, uint16] =
  for i in 0 .. TinyPhiCirc - 1: result[i] = 1
  for i in 1 .. 6:
    if result[i] == 0: continue
    result[i] = 0; let bp = i + i + 1
    let sqri = (i + i) * (i + 1)
    for c in countup(sqri, TinyPhiCirc - 1, bp): result[c] = 0
  var acc = 0'u16;
  for i in 0 .. TinyPhiCirc - 1: acc += result[i]; result[i] = acc
const TinyPhiLUT = makeTinyPhiLUT()
proc tinyPhi(x: int64): int64 {.inline.} =
  let ndx = (x - 1) div 2; let numtot = ndx div TinyPhiCirc.int64
  return numtot * TinyPhiRes.int64 + TinyPhiLUT[(ndx - numtot * TinyPhiCirc.int64).int].int64

proc countPrimes(n: int64): int64 =
  if n < 169: # below 169 whose sqrt is 13 is where TinyPhi doesn't work...
    if n < 3: return if n < 2: 0 else: 1
    # adjust for the missing "degree" base primes
    if n <= 13: return (n - 1) div 2 + (if n < 9: 1 else: 0)
    return 5 + TinyPhiLUT[(n - 1).int div 2].int64
  let rtlmt = n.float64.sqrt.int
  let mxndx = (rtlmt - 1) div 2
  var cmpsts = cast[ptr[UncheckedArray[byte]]](alloc0((mxndx + 8) div 8))
  for i in 1 .. mxndx:
    if (cmpsts[i shr 3] and masksp[i and 7]) != 0: continue
    let sqri = (i + i) * (i + 1)
    if sqri > mxndx: break
    let bp = i + i + 1
    for c in countup(sqri, mxndx, bp):
      let w = c shr 3; cmpsts[w] = cmpsts[w] or masksp[c and 7]
  var pisqrt = 0'i64
  for i in 0 .. mxndx:
    if (cmpsts[i shr 3] and masksp[i and 7]) == 0: pisqrt += 1
  let primes = cast[ptr[UncheckedArray[uint32]]](alloc(sizeof(uint32) * pisqrt.int))
  var j = 0
  for i in 0 .. mxndx:
    if (cmpsts[i shr 3] and masksp[i and 7]) == 0: primes[j] = (i + i + 1).uint32; j += 1
  var phi = tinyPhi(n)
  proc lvl(m, mbf: int64; mxa: int) = # recurse from bottom left of "tree"...
    for a in CC .. mxa:
      let p = primes[a].int64
      if m < p * p: phi += mbf * (mxa - a + 1).int64; return # rest of level all ones!
      let nm = (m.float64 / p.float64).int64; phi += mbf * tinyPhi(nm)
      if a > CC: lvl(nm, -mbf, a - 1) # split
    # finished level!
  lvl(n, -1, pisqrt.int - 1); result = phi + pisqrt - 1
  cmpsts.dealloc; primes.dealloc  

let strt = getMonoTime()
var pow = 1'i64
for i in 0 .. 9: echo "π(10^", i, ") = ", pow.countPrimes; pow *= 10
let elpsd = (getMonoTime() - strt).inMilliseconds
echo "This took ", elpsd, " milliseconds."

The results of the above code are identical to the previous version except it is almost 20 times faster due to the reduced recursion; however, the gain due to the "TinyPhi" LUT is most effective for relatively trivial counting ranges such as this and will have less and less benefit as ranges get larger, although the reduced recursion due to reordering is a constant factor gain across the range.

Note that the optimization to eliminate the need for memoization has been changed to "if m < p * p: phi += mbf * (mxa - a + 1).int64; return # rest of level all ones!", which has exactly the same effect as the previous optimization but stops the recursion for all "parent" branches of the same "level".

Larger "TinyPhi" LUT's can be used for some added benefit (especially for these smaller ranges) although they quickly get quite huge where the benefit is reduced due to lesser cache associativity; for instance the LUT for a "TinyPhi" of degree eight can a wheel circumference of 9699690 values which can be reduced by a factor of two by only considering odd values and another factor of two using the fact that the LUT is symmetric approaching the middle from the beginning and end of the LUT.

Reducing Operations by using Partial Sieving

There has been some work adapting the Legendre theory to the use of partial sieving as is known in the competitive programming communities, although there doesn't seem to be a link to a mathematical paper deriving the basis of that algorithm. The result of that work is a code algorithm that still uses memory proportional to the square root of the range to be counted but greatly reduces the computational complexity from O(n / (log n)^2) to O(n^(3/4) / (log n)^2). Since a mathematical paper supporting the algorithm has not been found, a brief explanation is provided here (if someone knows of such a paper, please edit in a link):

It turns out that there is another way to think of the "Phi"/φ(m, a)/inclusion-exclusion expression from which the task expression and the above implementations have been derived - that expresses the count of values left after the range to m has been partially sieved/partially culled by the primes up to the "ath" one. Now, Legendre would likely not have considered thinking of the inclusion/exclusion principle that way as he made no real practical use of his work other than some trivial examples showing that it worked and could calculate the number of primes up to a range only knowing the primes up to the square root of that range, as one has to be able to see that this can be used repeatedly to adjust Look Up Tables (LUT's) of partial counts with each partial sieve and, of course, there were no computers in his time (1800's).

The LUT's used in the following code are as follows:

The simple "smalls" LUT is a table of the counts of remaining values after culling of base primes up to a "kth prime" value up to the square root of the counting range (may as well be odd values since two is the only even prime), which could be updated by repeated scanning of the sieve buffer but is more efficiently updated by decrementing the values between each cull value by the constant difference. To make this easier, this "smalls" LUT includes the count of the base primes used up to the current state and doesn't include the one that is normally included in a "Phi" count.
Another important table is the "roughs" LUT where the "k-roughs" term wasn't used until post 2000 or so, but is the remaining (may as well be odd) values in a range (in this case to the square root of the counting range) after elimination of all multiples of primes up to the kth one. This table will start as half (because odds-only) the square root of the counting range in length but will be reduced in effective length by each partial sieving pass by each base prime value so that only prime values above the last base prime value remain (plus the first value of one) so will be approximately sqrt n / log n in length.
The final "larges" LUT is the most important one and perhaps the hardest to grasp as it contains the partial counts of values up to the entire counting range for each of the current "roughs" values so thus has an effective length that is synchronized and thus reduced with the above "roughs" LUT. It is initialized with the number of odd un-culled values up to and including each current "k-rough" value, and like the "smalls" LUT, its values are adjusted for each partial sieve pass. The updated values of this "larges" LUT cannot be determined only by reference to the remaining count after culling but the offset to be subtracted must be determined by look up of the previous values in this same LUT of all prime products smaller than the square root of the counting range.
These three LUT's change until partial sieving is complete, which since they represent values up to the square root of the counting range, happens after all culling by base prime values up to the quad root of the counting range.
As the Sieve of Eratosthenes sieving is only to the range of the square root of the counting range, it is a negligible part of the total execution time and hasn't been much optimized.

In order to understand the updating of the "larges" LUT as per the above, one needs to understand the relationship between the φ recursion as per the task description and partial sieving as follows:

The number of (odd) values remaining after a partial sieve and the "phi" expression are closely related as they only differ by the number of base primes and one.
As per the previous implementation, the "tree" is being adjusted and built from the bottom up, so the initial state of the LUT's represents the bottom of the tree just after partial sieving by the only even base prime of two, and every succeeding partial sieve pass need to build the immediate "parent" node counts based on the values available in the "child" nodes as per the current state of the LUT's.
The relationship between the "parents" and the "children" for a given count limit is given by the relationship "φ(x, a) = φ(x, a−1) − φ(⌊x/pa⌋, a−1)", which in words says that the "larges" LUT entry for a given count for a "parent" level adjustment will have a new value of the old value minus the looked up count of that value divided by the current partial sieve base prime cull value for the partial sieving pass.
There are two different means of obtaining the adjustment to be subtracted from the old "larges" count value: for products of the two values less than or equal to the square root of the counting table, the index of the product is looked up in the "smalls" LUT and that index is then used to find the adjustment in the "larges" LUT; for products larger than that limit, the quotient will be less than that limit so can be converted to an index and directly looked up in the "smalls" LUT.
Since the "larges" LUT needs to be treated in the same way as to offsets and scaling as the "smalls" LUT and the latter includes the count of current base primes used and doesn't include the one, the "larges" LUT needs to do the same. When adjusting for a partial sieving pass by taking a difference, the offset of the current base primes count is cancelled out and needs to be added back in after the difference.
The synced reduction in effective length of the "roughs" and "larges" LUT's due to culled values for each partial sieving pass is effected in the same loop that does the partial sieving.
During the partial sieving passes loop, it is impossible for the "non-branch splitting" optimization that makes memoization unnecessary to occur, so that check is not applied during this loop, thus saving a little execution time.
After partial sieving is complete (when the base primes reach the cube root of the counting range), the loop could continue without sieving to determine the rest of the "parent" values up to the top level with "a" equal to the last base prime up to the square root of the counting range, but that would be less efficient as to having to inject a check for the "non-splitting termination optimization", to determine that the base primes have reached a level where sieving isn't necessary, and a slightly less efficient compensation for the differences eliminating the base primes count and restoring on an operation-by-operation basis, so this loop is terminated and a new loop does this final phase.
This partial sieving and LUT building loop is the source of the major part of the execution complexity, which for very large ranges will tend toward a O(n^(3/4)/((log n)^2)) execution complexity as can be seen that we are using products of the base primes up the the cube root of the counting range and what will tend toward the number of primes up to the square root of the counting range.

Since we no longer need to use partial sieving and adjustment of the LUT's, at this stage the result of the current state of computation can be combined into a variable. As the zeroth index of "larges" LUT contains the count of the value one "rough" (also the zeroth value in the LUT), it is the result of only one base prime at a time and is therefore included and thus additive where all the other effective values in the "larges" LUT are the result of the product of two co-primes and therefore excluded and subtracted from the zeroth value. This will result in an over subtraction of the number of base primes up to the quad root of the counting range, so that will need to be added back in by the number of effective indexes greater than zero in the "larges" LUT minus one just as this offset is compensated as the values were adjusted and will have to be further adjusted as we add further inclusions to the final result.

At this stage, an answer has been obtained but it is not quite correct as to the inclusions due to the base primes/lowest prime factors between the quad root of the counting range and the square root of the counting range have not yet been added, as follows:

Again, these are pairs of (this time prime as partial sieving has been completed) factors with the base primes as described above.
It is easy to demonstrate why there cannot be more than two prime factors used since the smallest prime factor used will be just over the quad root of the counting range, and if there were three successive prime factors used, the quotient would be less than the quad root of the counting range which would be a breaking/non-splitting condition and this is the best case for three prime factors so three prime factors cannot be used in this loop
This second non-sieving loop will never reach the top end of the range for two large successive prime factors since when the prime factors start at about the cube root of the counting range, they will hit the terminating non-splitting condition and the loop will terminate. This termination is equivalent in effect to the non-splitting optimization so that "tree" splitting is not necessary but since the count LUT's (and the result accumulator variable) don't contain the one offset, no further compensation of adding a computed number of ones needs to be made.
Since there are always two prime factors as per the above, their compensation will always be positive as in included/added; this is due to that effectively they are being double subtracted as in once as an equivalent compensation to the "larges" values and once for the final transfer of the "larges" values to the result accumulator.
The product of the pairs of prime factors will always be greater than the square root of the counting range so the quotient will always be less than the square root of the counting range and only the "smalls" LUT needs to be used for these compensations.
As the compensations for each pair are each adding the count of base primes up to the quad root of the counting range to the accumulator, this needs to be compensated by subtracting a multiple of the number of compensations minus one times the number of base primes used.
Since this second loop terminates early, there are slightly less operations than in the first partial sieving loop as the base prime multiplier will never exceed about the cube root of the counting range so that the first loop controls the computational complexity more than this second loop.

Finally, we add one to the final answer to account for the only even prime of two.

The reason that this algorithm can reduce the number of operations as compared to the "standard" recursive Legendre algorithm in that the product of some combinations of multiple primes is compensated in one operation rather than many "branching"/recursive operations as in the recursive or semi-recursive algorithms.

One might question the validity of this "partial sieving" optimization considering the task explanation, but the Legendre prime counting function was likely never used practically at the time it was invented other than for trivial examples showing that it worked as mentioned above since there were no electronic computers; the follow-on work by Meissel was used by him to hand calculate the number of primes to a billion over the course of some number of years in about 1870, and he almost certainly used partial sieving as otherwise he would have done over nine million long divisions or over two million long divisions even with use of a "TinyPhi" Look Up Table of degree six, which would not be possible in his lifetime at something like a hundred long division calculations a day. D. H. Lehmer did not use partial sieving in his computer work on prime counting in 1959, but that was because he was focused on fitting a version of the Meissel algorithm into the memory limitations of the computers of that day (a few thousand words of RAM memory) such that he had to pre-calculate and store the primes values required on a digital magnetic tape that had to be played several times as the calculation proceeded and thus didn't fully investigate (or fully understand) all of the optimizations as used by Meissel, using the brute mathematical computational strength of the computer to overcome the lack of optimizations. So partial sieving as an optimization technique would have been known but not applied until Legarias, Miller, and Odlyzko (LMO) investigated a way to apply it to a computer algorithm. It could have been applied to both the Legendre and Meissel algorithms much sooner other than, at least for the Legendre algorithm, the need for memory proportional to the square root of the counting range would have limited the useful range to counting ranges of 1e12 or so when computers of the day had at most several Megabytes of RAM memory. One of the biggest advantages of LMO compared to even this optimization of Legendre is it greatly reduces memory requirements such that in 1985 it was possible to determine the count of primes to 4e16 on a mainframe computer with only a few megabytes of RAM memory, which took 1730 minutes even multi-threaded (two threads).

A Nim implementation of the above described algorithm is as follows:

# compile with: nim c -d:danger -t:-march=native --gc:arc

from std/monotimes import getMonoTime, `-`
from std/times import inMilliseconds
from std/math import sqrt

let masks = [ 1'u8, 2, 4, 8, 16, 32, 64, 128 ] # faster than bit twiddling
let masksp = cast[ptr[UncheckedArray[byte]]](unsafeAddr(masks[0]))

# non-recursive Legendre prime counting function for a range `n`...
# this has O(n^(3/4)/((log n)^2)) time complexity; O(n^(1/2)) space complexity.
proc countPrimes(n: int64): int64 =
  if n < 3: # can't odd sieve for value less than 3!
    return if n < 2: 0 else: 1
  else:
    proc half(n: int): int {.inline.} = (n - 1) shr 1 # convenience conversion to index
    # dividing using float64 is faster than int64 for some CPU's...
    # precision limits range to maybe 1e16!
    proc divide(nm, d: int64): int {.inline.} = (nm.float64 / d.float64).int
    let rtlmt = n.float64.sqrt.int # precision limits range to maybe 1e16!
    let mxndx = (rtlmt - 1) div 2
    var smalls = # current accumulated counts of odd primes 1 to sqrt range
      cast[ptr[UncheckedArray[uint32]]](alloc(sizeof(uint32) * (mxndx + 1)))
    # initialized for no sieving whatsoever other than odds-only - partial sieved by 2:
    #   0 odd primes to 1; 1 odd prime to 3, etc....
    for i in 0 .. mxndx: smalls[i] = i.uint32
    var roughs = # current odd k-rough numbers up to sqrt of range; k = 2
      cast[ptr[UncheckedArray[uint32]]](alloc(sizeof(uint32) * (mxndx + 1)))
    # initialized to all odd positive numbers 1, 3, 5, ... sqrt range...
    for i in 0 .. mxndx: roughs[i] = (i + i + 1).uint32
    # array of current phi counts for above roughs...
    # these are not strictly `phi`'s since they also include the
    # count of base primes in order to match the above `smalls` definition!
    var larges = # starts as size of counts just as `roughs` so they align!
      cast[ptr[UncheckedArray[int64]]](alloc(sizeof(int64) * (mxndx + 1)))
    # initialized for current roughs after accounting for even prime of two...
    for i in 0 .. mxndx: larges[i] = ((n div (i + i + 1) - 1) div 2).int64
    # cmpsts is a bit-packed boolean array representing
    # odd composite numbers from 1 up to rtlmt used for sieving...
    # initialized as "zeros" meaning all odd positives are potentially prime
    # note that this array starts at (and keeps) 1 to match the algorithm even
    # though 1 is not a prime, as 1 is important in computation of phi...
    var cmpsts = cast[ptr[UncheckedArray[byte]]](alloc0((mxndx + 8) div 8))

    # number of found base primes and current highest used rough index...
    var npc = 0; var mxri = mxndx
    for i in 1 .. mxndx: # start at index for 3; i will never reach mxndx...
      let sqri = (i + i) * (i + 1) # computation of square index!
      if sqri > mxndx: break # stop partial sieving due to square index limit!
      if (cmpsts[i shr 3] and masksp[i and 7]) != 0'u8: continue # if not prime
      # culling the base prime from cmpsts means it will never be found again
      cmpsts[i shr 3] = cmpsts[i shr 3] or masksp[i and 7] # cull base prime
      let bp = i + i + 1 # base prime from index!
      for c in countup(sqri, mxndx, bp): # SoE culling of all bp multiples...
        let w = c shr 3; cmpsts[w] = cmpsts[w] or masksp[c and 7]
      # partial sieving to current base prime is now completed!

      var ri = 0 # to keep track of current used roughs index!
      for k in 0 .. mxri: # processing over current roughs size...
        # q is not necessarily a prime but may be a
        # product of primes not yet culled by partial sieving;
        # this is what saves operations compared to recursive Legendre:
        let q = roughs[k].int; let qi = q shr 1 # index of always odd q!
        # skip over values of `q` already culled in the last partial sieve:
        if (cmpsts[qi shr 3] and masksp[qi and 7]) != 0'u8: continue
        # since `q` cannot be equal to bp due to cull of bp and above skip;
        let d = bp * q # `d` is odd product of some combination of odd primes!
        # the following computation is essential to the algorithm's speed:
        # see above description in the text for how this works:
        larges[ri] = larges[k] -
                     (if d <= rtlmt: larges[smalls[d shr 1].int - npc]
                      else: smalls[half(divide(n, d.int64))].int64) + npc.int64
        # eliminate rough values that have been culled in partial sieve:
        # note that `larges` and `roughs` indices relate to each other!
        roughs[ri] = q.uint32; ri += 1 # update rough value; advance rough index

      var m = mxndx # adjust `smalls` counts for the newly culled odds...
      # this is faster than recounting over the `cmpsts` array for each loop...
      for k in countdown(((rtlmt div bp) - 1) or 1, bp, 2): # k always odd!
        # `c` is correction from current count to desired count...
        # `e` is end limit index no correction is necessary for current cull...
        let c = smalls[k shr 1] - npc.uint32; let e = (k * bp) shr 1
        while m >= e: smalls[m] -= c; m -= 1 # correct over range down to `e`
      mxri = ri - 1; npc += 1 # set next loop max roughs index; count base prime
    # now `smalls` is a LUT of odd prime accumulated counts for all odd primes;
    # `roughs` is exactly the "k-roughs" up to the sqrt of range with `k` the
    #    index of the next prime above the quad root of the range;
    # `larges` is the partial prime counts for each of the `roughs` values...
    # note that `larges` values include the count of the odd base primes!!!
    # `cmpsts` are never used again!

    # the following does the top most "phi tree" calculation:
    result = larges[0] # the answer to here is all valid `phis`
    for i in 1 .. mxri: result -= larges[i] # combined here by subtraction
    # compensate for the included odd base prime counts over subracted above:
    result += ((mxri + 1 + 2 * (npc - 1)) * mxri div 2).int64

    # This loop adds the counts due to the products of the `roughs` primes,
    # of which we only use two different ones at a time, as all the
    # combinations with lower primes than the cube root of the range have
    # already been computed and included with the previous major loop...
    # see text description above for how this works...
    for j in 1 .. mxri:  # for all `roughs` (now prime) not including one:
      let p = roughs[j].int64; let m = n div p # `m` is the `p` quotient
      # so that the end limit `e` can be calculated based on `n`/(`p`^2)
      let e = smalls[half((m div p).int)].int - npc
      # following break test equivalent to non-memoization/non-splitting optmization:
      if e <= j: break # stop at about `p` of cube root of range!
      for k in j + 1 .. e: # for all `roughs` greater than `p` to end limit:
         result += smalls[half(divide(m, roughs[k].int64))].int64
      # compensate for all the extra base prime counts just added!
      result -= ((e - j) * (npc + j - 1)).int64

    result += 1 # include the count for the only even prime of two
    smalls.dealloc; roughs.dealloc; larges.dealloc; cmpsts.dealloc

let strt = getMonoTime()
var pow = 1'i64
for i in 0 .. 9: echo "π(10^", i, ") = ", pow.countPrimes; pow *= 10
let elpsd = (getMonoTime() - strt).inMilliseconds
echo "This took ", elpsd, " milliseconds."

The output of the above code is the same as the previous version except that it is almost ten times faster still and the execution time is almost too small to be measured for this trivial range (for prime counting functions) of only a billion; it takes a little over a second to calculate the count of primes up to 1e13 and just over 30 seconds to calculate the number of primes up to 1e16 on a modern CPU.

The above program uses about 250,000 long integer divisions to calculate the count of primes to a billion as here, which is small enough to be able to calculate it by hand in several man-years of work. The Meissel-LMO algorithm with usual basic optimizations greatly reduces the number of integer long divisions (to only about 17 thousand with a "TinyPhi" LUT of degree eight) so as to be even more possible to hand calculate in only a few man-months, although still subject to math errors as Meissel made in this computation to a billion in the about 1870's. Meissel did not have our advantage in electronic computational power, but just using partial sieving and even a small degree of "TinyPhi" table would have made it easily possible to hand calculate in a few man-years.

The advantage of this "k-roughs" algorithm is its relative simplicity, but that comes at the cost of fairly high memory use due to using Legendre such that it is only really usable to about 1e16 even for a modern computer. Another disadvantage is that each partial sieving pass must be completed before being able to proceed with the next, which precludes using page segmentation and the ease of introducing multi-threading that provides. One could use page segmented sieving as for the the LMO algorithm but if one were to add that much added code complexity, one may as well implement full LMO and also enjoy the reduction in memory use. Once one has page-segmented sieving, one can then easily convert the code to multi-threading for gains in speed proportional to the number of effective threads for a given computer.

As counting range is increased above these trivial ranges, a better algorithm such as that of LMO will likely perform much better than this and use much less memory (to proportional to the cube root of the range), although at the cost of greater code complexity and more lines of code (about 500 LOC). The trade-off between algorithms of the Legendre type and those of the Meissel type is that the latter decreases the time complexity but at the cost of more time spent sieving; modern follow-on work to LMO produces algorithms which are able to tune the balance between Legendre-type and Meissel-type algorithm in trading off the execution time costs between the different parts of the given algorithm for the best in performance, with the latest in Kim Walisch's tuning of Xavier Gourdon's work being about five to ten times faster than LMO (plus includes multi-threading abilities).

Perl

#!/usr/bin/perl

use strict; # https://rosettacode.org/wiki/Legendre_prime_counting_function
use warnings;
no warnings qw(recursion);
use ntheory qw( nth_prime prime_count );

my (%cachephi, %cachepi);

sub phi
  {
  return $cachephi{"@_"} //= do {
    my ($x, $aa) = @_;
    $aa <= 0 ? $x : phi($x, $aa - 1) - phi(int $x / nth_prime($aa), $aa - 1) };
  }

sub pi
  {
  return $cachepi{$_[0]} //= do {
    my $n = shift;
    $n < 2 ? 0 : do{ my $aa = pi(int sqrt $n); phi($n, $aa) + $aa - 1 } };
  }

print "e             n   Legendre    ntheory\n",
      "-             -   --------    -------\n";
for (1 .. 9)
  {
  printf "%d  %12d %10d %10d\n", $_, 10**$_, pi(10**$_), prime_count(10**$_);
  }

Output:

e             n   Legendre    ntheory
-             -   --------    -------
1            10          4          4
2           100         25         25
3          1000        168        168
4         10000       1229       1229
5        100000       9592       9592
6       1000000      78498      78498
7      10000000     664579     664579
8     100000000    5761455    5761455
9    1000000000   50847534   50847534

Phix

with javascript_semantics
--
-- While a phix dictionary can handle keys of {x,a}, for this 
-- task performance was dreadful (7,612,479 entries, >3mins),
-- so instead memophix maps known x to an index to memophia 
-- which holds the full [1..a] for each x, dropping to a much
-- more respectable (albeit not super-fast) 14s
--
constant memophix = new_dict()
sequence memophia = {}  -- 1..a (max 3401) for each x

function phi(integer x, a)
    if a=0 then return x end if
    integer adx = getd(x,memophix), res
    if adx=NULL then
        memophia = append(memophia,repeat(-1,a))
        adx = length(memophia)
        setd(x,adx,memophix)
    else
        object ma = memophia[adx]
        integer l = length(ma)
        if a>l then
            memophia[adx] = 0   -- kill refcount
            memophia[adx] = ma & repeat(-1,a-l)
        else
            res = ma[a]
            if res>=0 then return res end if
        end if
        ma = 0                  -- kill refcount
    end if
    res = phi(x, a-1) - phi(floor(x/get_prime(a)), a-1)
    memophia[adx][a] = res
    return res
end function
 
function pi(integer n)
    if n<2 then return 0 end if
    integer a = pi(floor(sqrt(n)))
    return phi(n, a) + a - 1
end function
 
atom t0 = time()
for i=0 to iff(platform()=JS?8:9) do
    printf(1,"10^%d  %d\n",{i,pi(power(10,i))})
--  printf(1,"10^%d  %d\n",{i,length(get_primes_le(power(10,i)))})
end for
?elapsed(time()-t0)

The commented-out length(get_primes_le()) gives the same results, in about the same time, and in both cases re-running the main loop a second time finishes near-instantly.

Output:

10^0  0
10^1  4
10^2  25
10^3  168
10^4  1229
10^5  9592
10^6  78498
10^7  664579
10^8  5761455
10^9  50847534
"14.0s"

(It is about 4 times slower under pwa/p2js so output is limited to 10^8, unless you like staring at a blank screen for 52s)

Picat

Performance is surprisingly good, considering that the Picat's library SoE is very basic and that I recompute the base primes each time pi() is called. This version takes 23 seconds on my laptop, outperforming the CoffeeScript version which takes 30 seconds.

table(+, +, nt)
phi(X, 0, _) = X.
phi(X, A, Primes) = phi(X, A - 1, Primes) - phi(X // Primes[A], A - 1, Primes).

pi(N) = Count, N < 2 => Count = 0.
pi(N) = Count =>
    M = floor(sqrt(N)),
    A = pi(M),
    Count = phi(N, A, math.primes(M)) + A - 1.

main =>
    N = 1,
    foreach (K in 0..9)
        writef("10^%w    %w%n", K, pi(N)),
        N := N * 10.

Output:

10^0	0
10^1	4
10^2	25
10^3	168
10^4	1229
10^5	9592
10^6	78498
10^7	664579
10^8	5761455
10^9	50847534

PicoLisp

Uses code from the Sieve of Eratosthenes to generate primes.

(setq Legendre-Max (** 10 9))

(load "plcommon/eratosthenes.l")  # see task "Sieve of Eratosthenes, 2x3x5x7 wheel version.

# Create an index tree of the first N primes up to √Legendre-Max
(setq Sieve (sieve (sqrt Legendre-Max)))
(balance 'Primes (mapcar 'cons (range 1 (length Sieve)) Sieve))
(de prime (N)  (cdr (lup Primes N)))

(de ϕ (X A)
   (cache '(NIL) (cons X A)
      (if (=0 A)
         X
         (- (ϕ X (dec A)) (ϕ (/ X (prime A)) (dec A))))))

(de π (N)
   (if (< N 2)
      0
      (let A (π (sqrt N))
         (+ (ϕ N A) A -1))))

(for N 10 
   (prinl "10\^" (dec N) "^I" (π (** 10 (dec N)))))

(bye)

Output:

10^0	0
10^1	4
10^2	25
10^3	168
10^4	1229
10^5	9592
10^6	78498
10^7	664579
10^8	5761455
10^9	50847534

Python

Using primesieve module (pip install primesieve).

from primesieve import primes
from math import isqrt
from functools import cache

p = primes(isqrt(1_000_000_000))

@cache
def phi(x, a):
    res = 0
    while True:
        if not a or not x:
            return x + res
    
        a -= 1
        res -= phi(x//p[a], a) # partial tail recursion

def legpi(n):
    if n < 2: return 0

    a = legpi(isqrt(n))
    return phi(n, a) + a - 1

for e in range(10):
    print(f'10^{e}', legpi(10**e))

Output:

10^0 0
10^1 4
10^2 25
10^3 168
10^4 1229
10^5 9592
10^6 78498
10^7 664579
10^8 5761455
10^9 50847534

Raku

Seems like an awful lot of pointless work. Using prime sieve anyway, why not just use it?

use Math::Primesieve;

my $sieve = Math::Primesieve.new;

say "10^$_\t" ~ $sieve.count: exp($_,10) for ^10;

say (now - INIT now) ~ ' elapsed seconds';

Output:

10^0	0
10^1	4
10^2	25
10^3	168
10^4	1229
10^5	9592
10^6	78498
10^7	664579
10^8	5761455
10^9	50847534
0.071464489 elapsed seconds

Sidef

func legendre_phi(x, a) is cached {
    return x if (a <= 0)
    __FUNC__(x, a-1) - __FUNC__(idiv(x, a.prime), a-1)
}

func legendre_prime_count(n) is cached {
    return 0 if (n < 2)
    var a = __FUNC__(n.isqrt)
    legendre_phi(n, a) + a - 1
}

print("e             n   Legendre    builtin\n",
      "-             -   --------    -------\n")

for n in (1 .. 9) {
    printf("%d  %12d %10d %10d\n", n, 10**n,
        legendre_prime_count(10**n), prime_count(10**n))
    assert_eq(legendre_prime_count(10**n), prime_count(10**n))
}

Output:

e             n   Legendre    builtin
-             -   --------    -------
1            10          4          4
2           100         25         25
3          1000        168        168
4         10000       1229       1229
5        100000       9592       9592
6       1000000      78498      78498
7      10000000     664579     664579

Alternative implementation of the Legendre phi function, by counting k-rough numbers <= n.

func rough_count (n,k) {

    # Count of k-rough numbers <= n.

    func (n,p) is cached {

        if (p > n.isqrt) {
            return 1
        }

        if (p == 2) {
            return (n >> 1)
        }

        if (p == 3) {
            var t = idiv(n,3)
            return (t - (t >> 1))
        }

        var u = 0
        var t = idiv(n,p)

        for (var q = 2; q < p; q.next_prime!) {

            var v = __FUNC__(t - (t % q), q)

            if (v == 1) {
                u += prime_count(q, p-1)
                break
            }

            u += v
        }

        t - u
    }(n*k, k)
}

func legendre_phi(n, a) {
     rough_count(n, prime(a+1))
}

func legendre_prime_count(n) is cached {
    return 0 if (n < 2)
    var a = __FUNC__(n.isqrt)
    legendre_phi(n, a) + a - 1
}

print("e             n   Legendre    builtin\n",
      "-             -   --------    -------\n")

for n in (1 .. 9) {
    printf("%d  %12d %10d %10d\n", n, 10**n,
        legendre_prime_count(10**n), prime_count(10**n))
    assert_eq(legendre_prime_count(10**n), prime_count(10**n))
}

Swift

import Foundation

extension Numeric where Self: Strideable {
  @inlinable
  public func power(_ n: Self) -> Self {
    return stride(from: 0, to: n, by: 1).lazy.map({_ in self }).reduce(1, *)
  }
}

func eratosthenes(limit: Int) -> [Int] {
  guard limit >= 3 else {
    return limit < 2 ? [] : [2]
  }

  let ndxLimit = (limit - 3) / 2 + 1
  let bufSize = ((limit - 3) / 2) / 32 + 1
  let sqrtNdxLimit = (Int(Double(limit).squareRoot()) - 3) / 2 + 1
  var cmpsts = Array(repeating: 0, count: bufSize)

  for ndx in 0..<sqrtNdxLimit where (cmpsts[ndx >> 5] & (1 << (ndx & 31))) == 0 {
    let p = ndx + ndx + 3
    var cullPos = (p * p - 3) / 2

    while cullPos < ndxLimit {
      cmpsts[cullPos >> 5] |= 1 << (cullPos & 31)

      cullPos += p
    }
  }

  return (-1..<ndxLimit).compactMap({i -> Int? in
    if i < 0 {
      return 2
    } else {
      if cmpsts[i >> 5] & (1 << (i & 31)) == 0 {
        return .some(i + i + 3)
      } else {
        return nil
      }
    }
  })
}

let primes = eratosthenes(limit: 1_000_000_000)

func φ(_ x: Int, _ a: Int) -> Int {
  struct Cache {
    static var cache = [String: Int]()
  }

  guard a != 0 else {
    return x
  }

  guard Cache.cache["\(x),\(a)"] == nil else {
    return Cache.cache["\(x),\(a)"]!
  }

  Cache.cache["\(x),\(a)"] = φ(x, a - 1) - φ(x / primes[a - 1], a - 1)

  return Cache.cache["\(x),\(a)"]!
}

func π(n: Int) -> Int {
  guard n > 2 else {
    return 0
  }

  let a = π(n: Int(Double(n).squareRoot()))

  return φ(n, a) + a - 1
}

for i in 0..<10 {
  let n = 10.power(i)

  print("π(10^\(i)) = \(π(n: n))")
}

Output:

π(10^0) = 0
π(10^1) = 4
π(10^2) = 25
π(10^3) = 168
π(10^4) = 1229
π(10^5) = 9592
π(10^6) = 78498
π(10^7) = 664579
π(10^8) = 5761455
π(10^9) = 50847534

Wren

Library: Wren-math

Memoized

This runs in about 4.5 seconds which is not too bad for the Wren interpreter. As map keys cannot be lists, the Cantor pairing function has been used to represent [x, a] which is considerably faster than using a string based approach for memoization.

To sieve a billion numbers and then count the primes up to 10^k would take about 53 seconds in Wren so, as expected, the Legendre method represents a considerable speed up.

import "./math" for Int

var pi = Fn.new { |n|
    if (n < 3) return (n < 2) ? 0 : 1
    var primes = Int.primeSieve(n.sqrt.floor)
    var a = primes.count
    var memoPhi = {}

    var phi // recursive closure
    phi = Fn.new { |x, a|
        if (a <= 1) return (a < 1) ? x : x - (x >> 1)
        var pa = primes[a-1]
        if (x <= pa) return 1
        var key = Int.cantorPair(x, a)
        if (memoPhi.containsKey(key)) return memoPhi[key]
        return memoPhi[key] = phi.call(x, a-1) - phi.call((x/pa).floor, a-1)
    }

    return phi.call(n, a) + a - 1
}

var n = 1
for (i in 0..9) {
    System.print("10^%(i)  %(pi.call(n))")
    n = n * 10
}

Output:

10^0  0
10^1  4
10^2  25
10^3  168
10^4  1229
10^5  9592
10^6  78498
10^7  664579
10^8  5761455
10^9  50847534

Non-memoized

Inspired by the non-memoized Nim version.

The memoized version requires a cache of around 6.5 million map entries, which at 8 bytes each (all numbers are 64 bit floats in Wren) for both the key and the value, equates in round figures to memory usage of 104 MB on top of that needed for the prime sieve. The following version strips out memoization and, whilst somewhat slower at 5.4 seconds, may be preferred in a constrained memory environment.

import "./math" for Int

var pi = Fn.new { |n|
    if (n < 3) return (n < 2) ? 0 : 1
    var primes = Int.primeSieve(n.sqrt.floor)
    var a = primes.count

    var phi // recursive closure
    phi = Fn.new { |x, a|
        if (a <= 1) return (a < 1) ? x : x - (x >> 1)
        var pa = primes[a-1]
        if (x <= pa) return 1
        return phi.call(x, a-1) - phi.call((x/pa).floor, a-1)
    }

    return phi.call(n, a) + a - 1
}

var n = 1
for (i in 0..9) {
    System.print("10^%(i)  %(pi.call(n))")
    n = n * 10
}

Output:

Same as first version.