Knuth's algorithm S: Difference between revisions
(→Tcl: Added implementation) |
|||
Line 107: | Line 107: | ||
The above can be instantiated as follows after which <code>s_of_n</code> can be called in the same way as it is in the first example where it is a function instead of an instance. |
The above can be instantiated as follows after which <code>s_of_n</code> can be called in the same way as it is in the first example where it is a function instead of an instance. |
||
<lang python>s_of_n = S_of_n_creator(3)</lang> |
<lang python>s_of_n = S_of_n_creator(3)</lang> |
||
=={{header|Tcl}}== |
|||
<lang tcl>package require Tcl 8.6 |
|||
oo::class create SofN { |
|||
variable items size count |
|||
constructor {n} { |
|||
set size $n |
|||
} |
|||
method item {item} { |
|||
if {[incr count] <= $size} { |
|||
lappend items $item |
|||
} elseif {rand()*$count < $size} { |
|||
lset items [expr {int($size * rand())}] $item |
|||
} |
|||
return $items |
|||
} |
|||
} |
|||
# Test code |
|||
for {set i 0} {$i < 100000} {incr i} { |
|||
set sOf3 [SofN new 3] |
|||
foreach digit {0 1 2 3 4 5 6 7 8 9} { |
|||
set digs [$sOf3 item $digit] |
|||
} |
|||
$sOf3 destroy |
|||
foreach digit $digs { |
|||
incr freq($digit) |
|||
} |
|||
} |
|||
parray freq</lang> |
|||
Sample output:<pre> |
|||
freq(0) = 29812 |
|||
freq(1) = 30099 |
|||
freq(2) = 29927 |
|||
freq(3) = 30106 |
|||
freq(4) = 30048 |
|||
freq(5) = 29993 |
|||
freq(6) = 29912 |
|||
freq(7) = 30219 |
|||
freq(8) = 30060 |
|||
freq(9) = 29824 |
|||
</pre> |
Revision as of 05:55, 22 October 2011
This is a method of randomly sampling n items from a set of M items, with equal probability; where M >= n and M, the number of items is unknown until the end. This means that the equal probability sampling should be maintained for all successive items > n as they become available (although the content of successive samples can change).
- The algorithm
- Select the first n items as the sample as they become available;
- For the i-th item where i > n, have a random chance of n/i of keeping it. If failing this chance, the sample remains the same. If not, have it randomly (1/n) replace one of the previously selected n items of the sample.
- Repeat #2 for any subsequent items.
- The Task
- Create a function
s_of_n_creator
that given the maximum sample size, returns a functions_of_n
that takes one parameter,item
. - Function
s_of_n
when called with successive items returns an equi-weighted random sample of up to n of its items so far, each time it is called, calculated using Knuths Algorithm S. - Test your functions by printing and showing the frequency of occurrences of the selected digits from 100,000 repetitions of:
- Use the s_of_n_creator with n == 3 to generate an s_of_n.
- call s_of_n with each of the digits 0 to 9 in order, keeping the returned three digits of its random sampling from its last call with argument item=9.
Note: A class taking n and generating a callable instance/function might also be used.
- Reference
- The Art of Computer Programming, Vol 2, 3.4.2 p.142
- Cf.
Python
<lang python>from random import random, randrange
def s_of_n_creator(n):
sample, i = [], 0 def s_of_n(item): nonlocal i
i += 1 if i <= n: # Keep first n items sample.append(item) elif random() < n / i: # Keep item del sample[randrange(n)] sample.append(item) return sample return s_of_n
if __name__ == '__main__':
bin = [0]* 10 items = range(10) print("Single run samples for n = 3:") s_of_n = s_of_n_creator(3) for item in items: sample = s_of_n(item) print(" Item: %i -> sample: %s" % (item, sample)) # for trial in range(100000): s_of_n = s_of_n_creator(3) for item in items: sample = s_of_n(item) for s in sample: bin[s] += 1 print("\nTest item frequencies for 100000 runs:\n ", '\n '.join("%i:%i" % x for x in enumerate(bin)))</lang>
- Sample output
Single run samples for n = 3: Item: 0 -> sample: [0] Item: 1 -> sample: [0, 1] Item: 2 -> sample: [0, 1, 2] Item: 3 -> sample: [0, 1, 3] Item: 4 -> sample: [0, 1, 3] Item: 5 -> sample: [0, 1, 3] Item: 6 -> sample: [0, 1, 3] Item: 7 -> sample: [0, 3, 7] Item: 8 -> sample: [0, 3, 7] Item: 9 -> sample: [0, 3, 7] Test item frequencies for 100000 runs: 0:29983 1:30240 2:29779 3:29921 4:30224 5:29967 6:30036 7:30050 8:29758 9:30042
Python Class based version
Only a slight change creates the following class-based implementation: <lang python>class S_of_n_creator():
def __init__(self, n): self.n = n self.i = 0 self.sample = [] def __call__(self, item): self.i += 1 n, i, sample = self.n, self.i, self.sample if i <= n: # Keep first n items sample.append(item) elif random() < n / i: # Keep item del sample[randrange(n)] sample.append(item) return sample</lang>
The above can be instantiated as follows after which s_of_n
can be called in the same way as it is in the first example where it is a function instead of an instance.
<lang python>s_of_n = S_of_n_creator(3)</lang>
Tcl
<lang tcl>package require Tcl 8.6
oo::class create SofN {
variable items size count constructor {n} {
set size $n
} method item {item} {
if {[incr count] <= $size} { lappend items $item } elseif {rand()*$count < $size} { lset items [expr {int($size * rand())}] $item } return $items
}
}
- Test code
for {set i 0} {$i < 100000} {incr i} {
set sOf3 [SofN new 3] foreach digit {0 1 2 3 4 5 6 7 8 9} {
set digs [$sOf3 item $digit]
} $sOf3 destroy foreach digit $digs {
incr freq($digit)
}
} parray freq</lang>
Sample output:
freq(0) = 29812 freq(1) = 30099 freq(2) = 29927 freq(3) = 30106 freq(4) = 30048 freq(5) = 29993 freq(6) = 29912 freq(7) = 30219 freq(8) = 30060 freq(9) = 29824