@p typesetter = tex % Set your own page size here . . . . \hsize 160mm \vsize 245mm \def\fwseca#1#2{\fwlibc{#1}{#2}} \def\fwsecb#1#2{\fwlibd{#1}{#2}} \def\fwsecc#1#2{\fwlibe{#1}{#2}} \def\fwsecd#1#2{\fwlibe{#1}{#2}} \def\fwsece#1#2{\fwlibe{#1}{#2}} @t vskip 40 mm @t title titlefont centre "A Programming Pearl" @t title titlefont centre "Generating sorted random numbers" @t vskip 20 mm @t title smalltitlefont centre "Don Knuth" @t vskip 10 mm @t title normalfont centre "FunnelWeb by Richard Walker" @t new_page @t table_of_contents @t new_page @!------------------------------------------------------------ @a@ Jon Bentley recently discussed the following interesting problem as one of his Programming Pearls' [{\sl Communications of the ACM 27\/} (December, 1984), 1179--1182]: The input consists of two integers @{M@} and @{N@}, with @{M@}$<$@{N@}. The output is a sorted list of @{M@} random numbers in the range 1..@{N@} in which no integer occurs more than once. For probability buffs, we desire a sorted selection without replacement in which each selection occurs equiprobably. The present program illustrates what I think is the best solution to the problem, when @{M@} is reasonably large yet small compared to @{N@}. It's the method described tersely in the answer to exercise 3.4.2--15 of my book {\sl Seminumerical Algorithms}, pp.~141 and 555. @!------------------------------------------------------------ @a@ For simplicity, all input and output in this program is assumed to be handled at the terminal. The @{WEB@} macros @{read_terminal@}, @{print@}, and @{print_ln@} defined here can easily be changed to accommodate other conventions. Input a value from the terminal. @$@@(@1@)@M==@{read(@1)@} Output to the terminal. @$@@(@1@)@M==@{write(@1)@} Output to the terminal and end the line. @$@@(@1@)@M==@{writeln(@1)@} @!------------------------------------------------------------ @a@ Here's an outline of the entire Pascal program: @o@==@{@- program sample(input,output); var @ @ begin @; end. @} @!------------------------------------------------------------ @a@ The global variables @{M@} and @{N@} have already been mentioned; we had better declare them. Other global variables will be declared later. @{M_max@} is the maximum value of @{M@} allowed in this program. @$@@M==@{5000@} @{2M-1_max@} is $2M-1$ for use later on. @$@<2M-1_max@>@M==@{9999@} @$@+=@{@- M: integer; { size of the sample } N: integer; { size of the population } @} @!------------------------------------------------------------ @a@ We assume the existence of a system routine called @{rand_int(i,j)@} that returns a random integer chosen uniformly in the range $i..j$. @$@==@{@- function rand_int(i,j: integer): integer; external; @} @!------------------------------------------------------------ @a@ After the user has specified @{M@} and @{N@}, we compute the sample by following a general procedure recommended by Bentley: @$@==@{@- @; size := 0; @; while size < M do begin T := rand_int(1,N); @; end; @@} @!------------------------------------------------------------ @a@ The main program just sketched has introduced several more globals. There's a set @{S@} of integers, whose representation will be deferred until later; but we can declare two auxiliary integer variables now. @$@+=@{@- size: integer; { the number of elements in set S } T: integer; { new candidate for membership in S } @} @!------------------------------------------------------------ @a The first order of business is to have a short dialogue with the user. @$@==@{@- repeat @@('population size: N = '@); @@(N@); if N <= 0 then @@('N should be positive!'@); until N > 0; repeat @@('sample size: M = '@); @@(M@); if M < 0 then @@('M shouldn''t be negative!'@) else if M > N then @@('M shouldn''t exceed N!'@) else if M > @ then @@('(Sorry, M must be at most ',@:1,'.)'@); until (M >= 0) and (M <= N) and (M <= @)@} @!------------------------------------------------------------ @a@ The key idea to an efficient solution of this sampling problem is to maintain a set whose entries are easily sorted. The method of ordered hash tables' [Amble and Knuth, {\sl The Computer Journal 17\/} (May 1974), 135--142] is ideally suited to this task, as we shall see. Ordered hashing is similar to ordinary linear probing, except that the relative order of keys is taken into account. The cited paper derives theoretical results that will not be rederived here, but we shall use the following fundamental property: {\sl The entries of an ordered hash table are independent of the order in which its keys were inserted}. Thus, an ordered hash table is a `canonical' representation of its set of entries. We shall represent @{S@} by an array of $2M$ integers. Since Pascal doesn't permit arrays of variable size, we must leave room for the largest possible table. @$@+=@{@- hash: array[0..@<2M-1_max@>] of integer; { the ordered hash table } H: 0..@<2M-1_max@>; { an index into hash } H_max: 0..@<2M-1_max@>; { the current hash size } alpha: real; { the ratio of table size to N } @} @!------------------------------------------------------------ @a @$@==@{@- H_max := 2 * M - 1; alpha := 2 * M / N; for H := 0 to H_max do hash[H] := 0@} @!------------------------------------------------------------ @a Now we come to the interesting part, where the algorithm tries to insert @{T@} into an ordered hash table. The hash address $H=\lfloor2M(T-1)/N\rfloor$ is used as a starting point, since this quantity is monotonic in @{T@} and almost uniformly distributed in the range $0\le H<2M$. @$@==@{@- H := trunc(alpha * (T-1)); while hash[H] > T do if H = 0 then H := H_max else H := H-1; if hash[H] < T then { T is not present } begin size := size + 1; @; end@} @!------------------------------------------------------------ The heart of ordered hashing is the insertion process. In general, the new key @{T@} will be inserted in place of a previous key$T_1==@{@- while hash[H] > 0 do begin TT := hash[H]; { we have 0 < TT < T } hash[H] := T; T := TT; repeat if H = 0 then H := H_max else H := H - 1; until hash[H] < T; end; hash[H] := T@} @!------------------------------------------------------------ @a @$@+=@{@- TT: integer; { a key that's being moved } @} @!------------------------------------------------------------ @a@ The climax of this program is the fact that the entries in our ordered hash table can easily be read out in increasing order. Why is this true? Well, we know that the final state of the table is independent of the order in which the elements entered. Furthermore it's easy to understand what the table looks like when the entries are inserted in decreasing order, because we have used a monotonic hash function. Therefore we know that the table must have an especially simple form. Suppose the nonzero entries are$T_1<\cdots@M==@{@@(hash[H] : 10@)@} @\$@==@{@- if hash[0] = 0 then { there was no wrap-around } begin for H := 1 to H_max do if hash[H] > 0 then @; end else begin for H := 1 to H_max do { print the wrapped-around entries } if hash[H] > 0 then if hash[H] < hash[0] then @; for H := 0 to H_max do if hash[H] >= hash[0] then @; end@}