Stream merge

2-stream merge: Read two sorted streams of items from external source (e.g. disk, or network), and write one stream of sorted items to external sink.; Common algorithm: keep 1 buffered item from each source, select minimal of them, write it, fetch another item from that stream from which the written item was.
N-stream merge: The same as above, but reading from N sources.; Common algorithm: same as above, but keep buffered items and their source descriptors in a heap.

Assume streams are very big. You must not suck them whole in the memory, but read them as streams.

Haskell

There is no built-in iterator or stream type for file operations in Haskell. But several such libraries exist.

conduit

<lang haskell>-- stack runhaskell --package=conduit-extra --package=conduit-merge

import Control.Monad.Trans.Resource (runResourceT) import qualified Data.ByteString.Char8 as BS import Data.Conduit (($$), (=$=)) import Data.Conduit.Binary (sinkHandle, sourceFile) import qualified Data.Conduit.Binary as Conduit import qualified Data.Conduit.List as Conduit import Data.Conduit.Merge (mergeSources) import System.Environment (getArgs) import System.IO (stdout)

main :: IO () main = do

   inputFileNames <- getArgs
   let inputs = [sourceFile file =$= Conduit.lines | file <- inputFileNames]
   runResourceT $ mergeSources inputs $$ sinkStdoutLn
 where
   sinkStdoutLn = Conduit.map (`BS.snoc` '\n') =$= sinkHandle stdout</lang>

pipes

<lang haskell>-- stack runhaskell --package=pipes-safe --package=pipes-interleave

import Pipes (runEffect, (>->)) import Pipes.Interleave (interleave) import Pipes.Prelude (stdoutLn) import Pipes.Safe (runSafeT) import Pipes.Safe.Prelude (readFile) import Prelude hiding (readFile) import System.Environment (getArgs)

main :: IO () main = do

   sourceFileNames <- getArgs
   let sources = map readFile sourceFileNames
   runSafeT . runEffect $ interleave compare sources >-> stdoutLn</lang>

Perl 6

<lang perl6>sub merge_streams ( @streams ) {

   my @s = @streams.map({ hash( STREAM => $_, HEAD => .get ) })\
                   .grep({ .<HEAD>.defined });

   return gather while @s {
       my $h = @s.min: *.<HEAD>;
       take $h<HEAD>;
       $h<HEAD> = $h<STREAM>.get
           orelse @s .= grep( { $_ !=== $h } );
   }

}

say merge_streams([ @*ARGS».&open ]);</lang>

Python

Built-in function open opens a file for reading and returns a line-by-line iterator (stream) over the file.

There exists a standard library function heapq.merge that takes any number of sorted stream iterators and merges them into one sorted iterator, using a heap.

<lang python>import heapq import sys

sources = sys.argv[1:] for item in heapq.merge(open(source) for source in sources):

   print(item)</lang>

REXX

<lang rexx>/**********************************************************************

Merge 1.txt ... n.txt into m.txt
1.txt 2.txt 3.txt 4.txt
1 19 1999 2e3
17 33 2999 3000
8 500 3999
- - - - /

n=4 high='ffff'x p.= Do i=1 To n

 f.i=i'.txt'
 Call get i
 End

Do Forever

 min=high
 Do i=1 To n
   If x.i<<min Then Do    /* avoid numerical comparison */
     imin=i
     min=x.i
     End
   End
 If min<<high Then Do
   Call o x.imin
   Call get imin
   End
 Else Do
   Call lineout oid
   Leave
   End
 End

Exit get: Procedure Expose f. x. high p.

 Parse Arg ii
 If lines(f.ii)=0 Then
   x.ii=high
 Else Do
   x.ii=linein(f.ii)
   If x.ii<<p.ii Then Do
     Say 'Input file' f.ii 'is not sorted ascendingly'
     Say p.ii 'precedes' x.ii
     Exit
     End
   p.ii=x.ii
   End
 Return

o: Say arg(1)

  Return lineout(oid,arg(1))</lang>

Output:

Shell

sort --merge source1 source2 sourceN > sink

zkl

This solution uses iterators, doesn't care where the streams orginate and only keeps the head of the stream on hand. <lang zkl>fcn mergeStreams(s1,s2,etc){ //-->Walker

  streams:=vm.arglist.pump(List(),fcn(s){ // prime and prune
     if( (w:=s.walker())._next() ) return(w);
     Void.Skip		// stream is dry
  });
  Walker().tweak(fcn(streams){
     if(not streams) return(Void.Stop);  // all streams are dry
     values:=streams.apply("value");	  // head of the streams
     v:=values.reduce('wrap(min,x){ if(min<=x) min else x });
     n:=values.find(v); w:=streams[n]; w._next();  // read next value from min stream
     if(w.atEnd) streams.del(n); // prune empty streams
     v
  }.fp(streams));

}</lang> Using infinite streams: <lang zkl>w:=mergeStreams([0..],[2..*,2],[3..*,3],T(5)); w.walk(20).println();</lang>

Output:

L(0,1,2,2,3,3,4,4,5,5,6,6,6,7,8,8,9,9,10,10)

Using files: <lang zkl>w:=mergeStreams(File("unixdict.txt"),File("2hkprimes.txt"),File("/dev/null")); do(10){ w.read().print() }</lang>

Output:

10th
1st
2
2nd
3
3rd
4th
5
5th
6th

Using the above example to squirt the merged stream to a file: <lang zkl>mergeStreams(File("unixdict.txt"),File("2hkprimes.txt"),File("/dev/null")) .pump(File("foo.txt","w"));</lang>

Output:

$ ls -l unixdict.txt 2hkprimes.txt foo.txt 
-rw-r--r-- 1 craigd craigd 1510484 Oct 29  2013 2hkprimes.txt
-rw-r--r-- 1 craigd craigd 1716887 Jun 16 23:34 foo.txt
-rw-r--r-- 1 craigd craigd  206403 Jun 11  2014 unixdict.txt