Run-length encoding: Difference between revisions

Content added Content deleted

Inline

Revision as of 23:21, 23 October 2009

This page uses content from Wikipedia. The original article was at Run-length_encoding. The list of authors can be seen in the page history. As with Rosetta Code, the text of Wikipedia is available under the GNU FDL. (See links for details on variance)

Given a string containing uppercase characters (A-Z), compress repeated 'runs' of the same character by storing the length of that run, and provide a function to reverse the compression. The output can be anything, as long as you can recreate the input with it.

Example:

Input: WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW

Output: 12W1B12W3B24W1B14W

Ada

<lang Ada> with Ada.Text_IO; use Ada.Text_IO; with Ada.Strings.Fixed; use Ada.Strings.Fixed; procedure Test_Run_Length_Encoding is

  function Encode (Data : String) return String is
  begin
     if Data'Length = 0 then
        return "";
     else
        declare
           Code  : constant Character := Data (Data'First);
           Index : Integer := Data'First + 1;
        begin
           while Index <= Data'Last and then Code = Data (Index) loop
              Index := Index + 1;
           end loop;
           declare
              Prefix : constant String := Integer'Image (Index - Data'First);
           begin
              return Prefix (2..Prefix'Last) & Code & Encode (Data (Index..Data'Last));
           end;
        end;
     end if;
  end Encode;
  function Decode (Data : String) return String is
  begin
     if Data'Length = 0 then
        return "";
     else
        declare
           Index : Integer := Data'First;
           Count : Natural := 0;
        begin
           while Index < Data'Last and then Data (Index) in '0'..'9' loop
              Count := Count * 10 + Character'Pos (Data (Index)) - Character'Pos ('0');
              Index := Index + 1;
           end loop;
           if Index > Data'First then
              return Count * Data (Index) & Decode (Data (Index + 1..Data'Last));
           else
              return Data;
           end if;
        end;
     end if;
  end Decode;

begin

  Put_Line (Encode ("WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW"));
  Put_Line (Decode ("12W1B12W3B24W1B14W"));

end Test_Run_Length_Encoding; </lang> Sample output:

12W1B12W3B24W1B14W
WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW

AWK

Works with: gawk

It works with "textual" input. Lines containing numbers are skipped, since they can't be represented in a not ambiguous way in this implementation (e.g. "11AA" would be encoded as "212A", which would be decoded as A repeated 212 times!)

Encoding

<lang awk>BEGIN {

FS=""

} /^[^0-9]+$/ {

 cp = $1; j = 0
 for(i=1; i <= NF; i++) {
   if ( $i == cp ) {
     j++; 
   } else {
     printf("%d%c", j, cp)
     j = 1
   }
   cp = $i
 }
 printf("%d%c", j, cp)

}</lang>

Decoding

<lang awk>BEGIN {

 RS="[0-9]+[^0-9]"
 final = "";

} {

 match(RT, /([0-9]+)([^0-9])/, r)
 for(i=0; i < int(r[1]); i++) {
   final = final r[2]
 }

} END {

 print final

}</lang>

ALGOL 68

Works with: ALGOL 68 version Standard - no extensions to language used

Works with: ALGOL 68G version Any - tested with release mk15-0.8b.fc9.i386

Note: The following uses iterators, eliminating the need of declaring arbitrarily large CHAR arrays for caching. <lang algol>STRING input := "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW"; STRING output := "12W1B12W3B24W1B14W";

MODE ITERATOR = VOID; MODE YIELDCHAR = PROC(CHAR)ITERATOR; MODE ITERCHAR = PROC(YIELDCHAR)ITERATOR;

PROC char seq = (REF STRING s, YIELDCHAR yield)ITERATOR:

 FOR i FROM LWB s TO UPB s DO
   yield(s[i])
 OD;

Note: The following 2 lines use currying. This not supported by ELLA ALGOL 68RS #

ITERCHAR input seq = char seq(input,),

        output seq = char seq(output,);

PROC for encode = (ITERCHAR for seq, YIELDCHAR yield)ITERATOR: (

 INT count := 0;
 CHAR prev;

FOR c IN seq DO #

 for seq((CHAR c)VOID: (
     IF count = 0 THEN
       count := 1;
       prev := c
     ELIF c /= prev THEN
       STRING str count := whole(count,0);
       char seq(str count, yield); count := 1;
       yield(prev); prev := c
     ELSE
       count +:=1
     FI
   )
 )

OD #;

 IF count /= 0 THEN
   STRING str count := whole(count,0);
   char seq(str count,yield);
   yield(prev)
 FI

);

STRING zero2nine = "0123456789";

PROC for decode = (ITERCHAR for seq, YIELDCHAR yield)ITERATOR: (

 INT repeat := 0;

FOR c IN seq DO #

 for seq((CHAR c)VOID: (
   IF char in string(c, NIL, zero2nine) THEN
     repeat := repeat*10 + ABS c - ABS "0"
   ELSE
     FOR i TO repeat DO
       yield(c)
     OD;
     repeat := 0
   FI
 ))

);

iterate through input string #

print("Encode input: ");

FOR c IN encode(input seq) DO #

 for encode(input seq, (CHAR c)VOID:
   print(c)
 )

OD #;

print(new line);

print("Decode output: ");

FOR c IN decode(output seq) DO #

 for decode(output seq, (CHAR c)VOID:
   print(c)
 )

OD #;

print(new line)</lang>

Encode input: 12W1B12W3B24W1B14W
Decode output: WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW

AutoHotkey

<lang AutoHotkey> MsgBox % key := rle_encode("WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW") MsgBox % rle_decode(key)

rle_encode(message) {

 StringLeft, previous, message, 1
 StringRight, last, message, 1
 message .= Asc(Chr(last)+1)
 count = 0
 Loop, Parse, message
 {
   If (previous == A_LoopField)
     count +=1
   Else
   {
     output .= previous . count
     previous := A_LoopField 
     count = 1
   }
 }
 Return output

}

rle_decode(message) {

 pos = 1
 While, item := RegExMatch(message, "\D", char, pos)
 {
   digpos := RegExMatch(message, "\d+", dig, item)
   Loop, % dig
     output .= char
   pos := digpos 
 }
 Return output

} </lang>

C

These functions have no check for the size of the output buffers.

Encoding function

Since repeat counter must fit a single byte in this implementation, it can't be greater than 255, so a byte repeated more than 255 times generates in the compressed stream more than 2 bytes (4 bytes if the length of the repeated byte sequence is less than 511 and so on)

<lang c>int rle_encode(char *out, const char *in, int l) {

 int dl, i;
 char cp, c;

 for(cp=c= *in++, i = 0, dl=0; l>0 ; c = *in++, l-- ) {
   if ( c == cp ) {
     i++;
     if ( i > 255 ) {

*out++ = 255; *out++ = c; dl += 2; i = 1;

     }
   } else {
     *out++ = i;
     *out++ = cp; dl += 2;
     i = 1;
   }
   cp = c;
 }
 *out++ = i; *out++ = cp; dl += 2;
 return dl;

}</lang>

Decoding function

<lang c>int rle_decode(char *out, const char *in, int l) {

 int i, j, tb;
 char c;

 for(tb=0 ; l>=0 ; l -= 2 ) {
   i = *in++;
   c = *in++;
   tb += i;
   while(i-- > 0) *out++ = c;
 }
 return tb;

}</lang>

Usage example

<lang c>#include <stdio.h>

include <stdlib.h>
include <string.h>

const char *o = "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW";

int main() {

 char *d = malloc(2*strlen(o));
 char *oc = malloc(strlen(o));
 
 int rl = rle_encode(d, o, strlen(o));
 /* fwrite(d, 1, rl, stdout); */

 int ocl = rle_decode(oc, d, rl);
 fwrite(oc, 1, ocl, stdout);

 free(d); free(oc);
 return 0;

}</lang>

In the following codes, encoding and decoding are implementend as "filters" which compress/decompress standard input to standard output writing ASCII strings; they will work as long as the input has no ASCII digits in it, and the compressed/original ratio of a "single group" will be less than or equal to 1 as long as the ASCII decimal representation's length of the repeat counter will be shorter than the length of the "group". It should be so except in the case the group is a single isolated character, e.g. B gives 1B (one byte against two compressed bytes)

Encoding filter

<lang c>#include <stdio.h>

int main() {

 int i, c, cp;

 for(cp=c=getchar(), i = 0; c != EOF; c = getchar() ) {
   if ( c == cp ) {
     i++;
   } else {
     printf("%d%c", i, cp);
     i = 1;
   }
   cp = c;
 }
 printf("%d%c", i, cp);

}</lang>

Decoding filter

<lang c>#include <stdio.h>

int main() {

 int i, c, j;

 while( scanf("%d%c", &i, &c) == 2 ) {
   for(j=0; j < i; j++) printf("%c", c);
 }

}</lang>

Final note: since the repeat counter value 0 has no meaning, it could be used as it would be 256, so extending by one the maximum number of repetitions representable with a single byte; or instead it could be used as a special marker to encode in a more efficient way (long) sequences of isolated characters, e.g. "ABCDE" would be encoded as "1A1B1C1D1E"; it could be instead encoded as "05ABCDE".

Clojure

<lang clojure> (defn step [[result, n, c], new] ;function used in encode's reduce call

 (cond
   (zero? n) [result, 1, new]
   (= c new) [result, (inc n), c]
   :else     [(conj result [n c]), 1, new]))

(defn encode [s]

 (let [[result,n,chr] (reduce step [[],0,nil] s)]
   (if (= n 0)
     result
     (conj result [n chr]))))

(defn decode [v]

 (let [expand (fn n c (repeat n c))]
   (apply str (mapcat expand v))))

</lang> <lang clojure> (def uncoded "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW") (def coded [[12 \W] [1 \B] [12 \W] [3 \B] [24 \W] [1 \B] [14 \W]])

(assert (= (encode uncoded) coded)) (assert (= (decode coded) uncoded)) </lang>

Common Lisp

<lang lisp>(defun group-similar (sequence &key (test 'eql))

 (loop for x in (rest sequence)
       with temp = (subseq sequence 0 1)
       if (funcall test (first temp) x)
         do (push x temp)
       else
         collect temp
         and do (setf temp (list x))))

(defun run-length-encode (sequence)

 (mapcar (lambda (group) (list (first group) (length group)))
         (group-similar (coerce sequence 'list))))

(defun run-length-decode (sequence)

 (reduce (lambda (s1 s2) (concatenate 'simple-string s1 s2))
         (mapcar (lambda (elem)
                   (make-string (second elem)
                                :initial-element
                                (first elem)))
                 sequence)))

(run-length-encode "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW") (run-length-decode '((#\W 12) (#\B 1) (#\W 12) (#\B 3) (#\W 24) (#\B 1)))</lang>

D

<lang d> import std.stdio; import std.string; void main() {

       char[]rle = "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW";
       char[]encoded = encode(rle);
       char[]decoded = decode(encoded);
       writefln("\"%s\" == \"%s\", intermediary %s",rle,decoded,encoded);
       assert(rle == decoded);

}

// this is essentially an exact copy of the look and say D function char[]encode(char[]input) {

       char last = input[$-1];
       char[]output;
       int count = 0;
       foreach_reverse(i;input) {
               if (i == last) {
                       count++;
               } else {
                       output = toString(count)~last~output;
                       count = 1;
                       last = i;
               }
       }
       output = toString(count)~last~output;
       return output;

}

char[]decode(char[]input) {

       char[]i = "";
       char[]ret;
       foreach(letter;input) {
               if (letter <= 'Z' && letter >= 'A') {
                       // this is the letter to be repeated
                       if (!i.length) throw new Exception("Can not repeat a letter without a number of repetitions");
                       ret ~= repeat([letter],atoi(i));
                       i = null;
               } else if (letter <= '9' && letter >= '0') {
                       // this is a digit to mark the number of repetitions
                       i ~= letter;
               } else {
                       throw new Exception("'"~letter~"' is not capalphanumeric");
               }
       }
       return ret;

} </lang>

E

<lang e>def rle(string) {

 var seen := null
 var count := 0
 var result := []
 def put() {
   if (seen != null) {
     result with= [count, seen]
   }
 }
 for ch in string {
   if (ch != seen) {
     put()
     seen := ch
     count := 0
   }
   count += 1
 }
 put()
 return result

}

def unrle(coded) {

 var result := ""
 for [count, ch] in coded {
   result += E.toString(ch) * count
 }
 return result

}</lang>

<lang e>? rle("WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW")

value: [[12, 'W'], [1, 'B'], [12, 'W'], [3, 'B'], [24, 'W'], [1, 'B'], [14, 'W']]

? unrle(rle("WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW"))

value: "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW"</lang>

FALSE

<lang false> 1^[^$~][$@$@=$[%%\1+\$0~]?~[@.,1\$]?%]#%\., {encode} </lang> <lang false> [0[^$$'9>'0@>|~]['0-\10*+]#]n: [n;!$~][[\$][1-\$,]#%%]#%% {decode} </lang>

Fan

- Generates a run-length encoding for a string

class RLE {

 Run[] encode(Str s)
 {
   runs := Run[,]

   s.size.times |i|
   {
     ch := s[i]
     if (runs.size==0 || runs.last.char != ch)
       runs.add(Run(ch))
     runs.last.inc
   }
   return runs
 }

 Str decode(Run[] runs)
 {
   buf := StrBuf()
   runs.each |run|
   {
     run.count.times { buf.add(run.char.toChar) }
   }
   return buf.toStr
 }

 Void main()
 {
   echo(decode(encode(

"WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW"

       )))
 }

}

internal class Run {

 Int char
 Int count := 0
 new make(Int ch) { char = ch }
 Void inc() { ++count }

 override Str toStr() { return "${count}${char.toChar}" }

} </lang>

Haskell

<lang haskell>import Data.List (group)

-- Datatypes type Encoded = [(Int, Char)] -- An encoded String with form [(times, char), ...] type Decoded = String

-- Takes a decoded string and returns an encoded list of tuples rlencode :: Decoded -> Encoded rlencode = map (\g -> (length g, head g)) . group

-- Takes an encoded list of tuples and returns the associated decoded String rldecode :: Encoded -> Decoded rldecode = concatMap decodeTuple

   where decodeTuple (n,c) = replicate n c

main :: IO () main = do

 -- Get input
 putStr "String to encode: "
 input <- getLine
 -- Output encoded and decoded versions of input
 let encoded = rlencode input
     decoded = rldecode encoded
 putStrLn $ "Encoded: " ++ show encoded ++ "\nDecoded: " ++ show decoded</lang>

J

  NB. Run-length encode and decode...
  
  rle=: ([: ; (":@# , {.)&.>)@((1 , }. ~: }:) <;.1 ])
  rle 'WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW'

12W1B12W3B24W1B14W

  rld=. '0123456789'&((i. ".@:{ ' ' ,~ [) # -.@:e.~ # ])
  rld'12W1B12W3B24W1B14W'

WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW </lang>

Java

<lang java>import java.util.regex.Matcher; import java.util.regex.Pattern; public class RunLengthEncoding {

   public static String encode(String source) {
       StringBuffer dest = new StringBuffer();
       for (int i = 0; i < source.length(); i++) {
           int runLength = 1;
           while (i+1 < source.length() && source.charAt(i) == source.charAt(i+1)) {
               runLength++;
               i++;
           }
           dest.append(runLength);
           dest.append(source.charAt(i));
       }
       return dest.toString();
   }

   public static String decode(String source) {
       StringBuffer dest = new StringBuffer();
       Pattern pattern = Pattern.compile("[0-9]+|[a-zA-Z]");
       Matcher matcher = pattern.matcher(source);
       while (matcher.find()) {
           int number = Integer.parseInt(matcher.group());
           matcher.find();
           while (number-- != 0) {
               dest.append(matcher.group());
           }
       }
       return dest.toString();
   }

   public static void main(String[] args) {
       String example = "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW";
       System.out.println(encode(example));
       System.out.println(decode("1W1B1W1B1W1B1W1B1W1B1W1B1W1B"));
   }

}</lang> Tests:

Library: JUnit

<lang java>import static org.junit.Assert.assertEquals;

import org.junit.Test;

public class RunLengthEncodingTest { private RLE = new RunLengthEncoding();

@Test public void encodingTest() { assertEquals("1W", RLE.encode("W")); assertEquals("4W", RLE.encode("WWWW")); assertEquals("5w4i7k3i6p5e4d2i1a", RLE.encode("wwwwwiiiikkkkkkkiiippppppeeeeeddddiia")); assertEquals("12B1N12B3N24B1N14B", RLE.encode("BBBBBBBBBBBBNBBBBBBBBBBBBNNNBBBBBBBBBBBBBBBBBBBBBBBBNBBBBBBBBBBBBBB")); assertEquals("12W1B12W3B24W1B14W", RLE.encode("WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW")); assertEquals("1W1B1W1B1W1B1W1B1W1B1W1B1W1B", RLE.encode("WBWBWBWBWBWBWB"));

}

@Test public void decodingTest() { assertEquals("W", RLE.decode("1W")); assertEquals("WWWW", RLE.decode("4W")); assertEquals("WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW", RLE.decode("12W1B12W3B24W1B14W")); assertEquals("WBWBWBWBWBWBWB", RLE.decode("1W1B1W1B1W1B1W1B1W1B1W1B1W1B")); assertEquals("WBWBWBWBWBWBWB", RLE.decode("1W1B1W1B1W1B1W1B1W1B1W1B1W1B"));

} }</lang>

JavaScript

Here's an encoding method that walks the input string character by character <lang javascript>function encode(input) {

   var encoding = [];
   var prev, count, i;
   for (count = 1, prev = input[0], i = 1; i < input.length; i++) {
       if (input[i] != prev) {
           encoding.push([count, prev]);
           count = 1;
           prev = input[i];
       }
       else 
           count ++;
   }
   encoding.push([count, prev]);
   return encoding;

}</lang>

Here's an encoding method that uses a regular expression to grab the character runs (

Works with: JavaScript version 1.6

for the forEach method)

<lang javascript>function encode_re(input) {

   var encoding = [];
   input.match(/(.)\1*/g).forEach(function(substr){ encoding.push([substr.length, substr[0]]) });
   return encoding;

}</lang>

And to decode (see Repeating a string) <lang javascript>function decode(encoded) {

   var output = "";
   encoded.forEach(function(pair){ output += new Array(1+pair[0]).join(pair[1]) })
   return output;

}</lang>

Logo

<lang logo> to encode :str [:out "||] [:count 0] [:last first :str]

 if empty? :str [output (word :out :count :last)]
 if equal? first :str :last [output (encode bf :str :out :count+1 :last)]
 output (encode bf :str (word :out :count :last) 1 first :str)

end

to reps :n :w

 output ifelse :n = 0 ["||] [word :w reps :n-1 :w]

end to decode :str [:out "||] [:count 0]

 if empty? :str [output :out]
 if number? first :str [output (decode bf :str :out 10*:count + first :str)]
 output (decode bf :str word :out reps :count first :str)

end

make "foo "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW make "rle encode :foo show equal? :foo decode :rle </lang>

Mathematica

Custom functions using Map, Apply, pure functions, replacing using pattern matching, delayed rules and other functions: <lang Mathematica>

RunLengthEncode[input_String]:=StringJoin@@Sequence@@@({ToString @Length[#],First[#]}&/@Split[Characters[input]])
RunLengthDecode[input_String]:=StringJoin@@ConstantArray@@@Reverse/@Partition[(Characters[input]/.(ToString[#]->#&/@Range[0,9]))//.{x___,i_Integer,j_Integer,y___}:>{x,10i+j,y},2]

</lang> Example: <lang Mathematica>

mystring="WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW";
RunLengthEncode[mystring]
RunLengthDecode[%]
%==mystring

</lang> gives back: <lang Mathematica>

12W1B12W3B24W1B14W
WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW
True

</lang>

Objective-C

Works with: GNUstep

The class RCRunLengthEncoder represents internally data with which it was feeded as pair character - repetition counter: it does not implement a binary representation of itself; it is left to another class, so that different input/output encodings are possible starting from the same class.

<lang objc>#import <Foundation/Foundation.h>

@interface RCRunLengthEncoder : NSObject {

 NSMutableArray *contents;
 NSMutableArray *counters;

} + (id)alloc; + (id)encoderWithData: (NSData *)data; - (id)init; - (id)initWithData: (NSData *)data; - (void)dealloc; - (void)addByte: (char)aByte; - (void)addByte: (char)aByte repeated: (int)repetitionCount; - (void)addData: (NSData *)data; - (NSData *)data; - (NSArray *)counters; - (NSArray *)contents; @end

@implementation RCRunLengthEncoder + (id)encoderWithData: (NSData *)data {

 return [[[self alloc] initWithData: data] autorelease];

}

- (id)initWithData: (NSData *)data {

 if ((self = [self init]) != nil) {
   [self addData: data];
 }
 return self;

}

- (id)init {

 if ((self = [super init]) != nil) {
   contents = [[NSMutableArray alloc] init];
   counters = [[NSMutableArray alloc] init];
 }
 return self;

}

- (void)dealloc {

 [counters release];
 [contents release];
 [super dealloc];

}

- (void)addByte: (char)aByte {

 [self addByte: aByte repeated: 1];

}

- (void)addByte: (char)aByte repeated: (int)repetitionCount {

 if ( ([contents count] == 0) || ([[contents lastObject] charValue] != aByte) ) {
   [contents addObject: [NSNumber numberWithChar: aByte]];
   [counters addObject: [NSNumber numberWithInt: repetitionCount]];
 } else {
   NSNumber *a = [counters lastObject];
   [counters removeLastObject];
   [counters addObject: [NSNumber numberWithInt:

([a intValue] + repetitionCount) ]];

}

- (void)addData: (NSData *)data {

 const char *d = [data bytes];
 NSUInteger i;
 for(i=0; i < [data length]; i++) [self addByte: d[i]];

}

- (NSArray *)contents {

 return contents;

}

- (NSArray *)counters {

 return counters;

}

- (NSData *)data {

 NSMutableData *d = [[NSMutableData alloc] initWithCapacity: 256];
 char buf[256];
 int i;

 for(i=0; i < [contents count]; i++) {
   char c = [[contents objectAtIndex: i] charValue];
   int n = [[counters objectAtIndex: i] intValue];
   memset(buf, c, 256);
   while ( n > 0 ) {
     [d appendBytes: buf length: MIN(256, n)];
     n -= 256;
   }
 }
 return [d autorelease];

} @end</lang>

The class codecRLE is derived from the previous, adding the methods that allow to binary encode the data internally held, and to create a internal representation from the encoded data. The specification here used are:

byte N >= 128, then the next byte must be repeated N-128 times (if N-128 is 0, the the byte must be repeated 128 times)
byte N < 128, then the next N bytes must be taken as they are (if N is 0, the next 128 bytes are literal)

<lang objc>@interface codecRLE : RCRunLengthEncoder - (NSData *)encode; - (void)decode: (NSData *)enc; @end

@implementation codecRLE - (void)decode: (NSData *)enc {

 const char *buf = [enc bytes];
 int i;

 for(i = 0; i < [enc length]; ) {
   if ( (buf[i] & 0x80) != 0) {
     [self addByte: buf[i+1] repeated: ( ((buf[i]&0x7f) == 0) ? 128 : (buf[i]&0x7f) ) ];
     i += 2;
   } else {
     int rc = (buf[i] == 0) ? 128 : buf[i];
     [self addData: [NSData dataWithBytes: &buf[i+1] length: rc]];
     i += rc + 1;
   }
 }

}

- (NSData *)encode {

 int literalCount=0;
 int i;
 char buf[129];

 NSMutableData *r = [[NSMutableData alloc] initWithCapacity: 256];

 for(i=0; i < [counters count]; i++) {
   char c = [[contents objectAtIndex: i] charValue];
   int howMany = [[counters objectAtIndex: i] intValue];
   if ( literalCount == 128 ) {
     buf[0] = 0;
     [r appendBytes: buf length: 129];
     literalCount = 0;
   }
   if ( howMany == 1 ) {
     buf[literalCount+1] = c;
     literalCount++;
   } else {
     if ( literalCount > 0 ) {

buf[0] = literalCount & 0x7f; [r appendBytes: buf length: (literalCount+1) ]; literalCount = 0;

     }
     buf[1] = c;
     while( howMany > 127 ) {

buf[0] = 0x80; [r appendBytes: buf length: 2]; howMany -= 128;

     }
     if (howMany > 0) {

buf[0] = howMany | 0x80; [r appendBytes: buf length: 2];

     }
   }
 }
 return [r autorelease];

} @end</lang>

Usage example:

<lang objc>const char *s = "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW";

int main() {

 NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];

 codecRLE *enc = [[codecRLE alloc]

initWithData: [NSData dataWithBytes: s length: strlen(s)] ];

 NSData *repr = [enc encode];
 fwrite([repr bytes], 1, [repr length], stdout);
 [enc release];

 enc = [[codecRLE alloc] init];
 [enc decode: repr];
 NSData *d = [enc data];
 fwrite([d bytes], 1, [d length], stdout);
 [enc release];

 [pool release];
 return EXIT_SUCCESS;

}</lang>

Notes

The code is not deeply tested yet

OCaml

<lang ocaml>let encode str =

 let len = String.length str in
 let rec aux i acc =
   if i >= len then List.rev acc
   else
     let c1 = str.[i] in
     let rec aux2 j =
       if j >= len then (c1, j-i)
       else
         let c2 = str.[j] in
         if c1 = c2
         then aux2 (j+1)
         else (c1, j-i)
     in
     let (c,n) as t = aux2 (i+1) in
     aux (i+n) (t::acc)
 in
 aux 0 []

let decode lst =

 let l = List.map (fun (c,n) -> String.make n c) lst in
 (String.concat "" l)</lang>

<lang ocaml>let () =

 let e = encode "aaaaahhhhhhmmmmmmmuiiiiiiiaaaaaa" in
 List.iter (fun (c,n) ->
   Printf.printf " (%c, %d);\n" c n;
 ) e;
 print_endline (decode [('a', 5); ('h', 6); ('m', 7); ('u', 1); ('i', 7); ('a', 6)]);

</lang>

Perl

<lang perl>sub encode

{my $str = shift;
 $str =~ s {(.)(\1*)} {(length($2) + 1) . $1 . ';'}gse;
 return $str;}

sub decode

{my $str = shift;
 $str =~ s {(\d+)(.);} {$2 x $1}gse;
 return $str;}</lang>

The following modified versions of the previous one, encode/decode a bytes sequence in a way compatible with the functions of the C version.

<lang perl>sub encode

{my $str = shift;
 $str =~ s {(.)(\1{0,254})} {pack("C",(length($2) + 1)) . $1 }gse;
 return $str;}

sub decode {

    my @str = split //, shift;
    my $r = "";
    foreach my $i (0 .. scalar(@str)/2-1) {

$r .= $str[2*$i + 1] x unpack("C", $str[2*$i]);

    }
    return $r;

}</lang>

PowerBASIC

In this example, a character that only appears once isn't given a number in the encoded sequence. This prevents the code from getting longer. (A string without any runs is returned unchanged.) This version can also handle any arbitrary string that doesn't contain numbers (not just letters). (A flag value could be added which would allow the inclusion of any character, but such a flag isn't in this example.)

<lang powerbasic>FUNCTION RLDecode (i AS STRING) AS STRING

   DIM Loop0 AS LONG, count AS STRING, outP AS STRING, m AS STRING

   FOR Loop0 = 1 TO LEN(i)
       m = MID$(i, Loop0, 1)
       SELECT CASE m
           CASE "0" TO "9"
               count = count & m
           CASE ELSE
               IF LEN(count) THEN
                   outP = outP & STRING$(VAL(count), m)
                   count=""
               ELSE
                   outP = outP & m
               END IF
       END SELECT
   NEXT
   FUNCTION = outP

END FUNCTION

FUNCTION RLEncode (i AS STRING) AS STRING

   DIM tmp1 AS STRING, tmp2 AS STRING, outP AS STRING
   DIM Loop0 AS LONG, count AS LONG

   FOR Loop0 = 1 TO LEN(i)
       tmp1 = MID$(i, Loop0, 1)
       IF tmp1 <> tmp2 THEN
           IF count > 1 THEN
               outP = outP & TRIM$(STR$(count)) & tmp2
               tmp2 = tmp1
               count = 1
           ELSEIF 0 = count THEN
               tmp2 = tmp1
               count = 1
           ELSE
               outP = outP & tmp2
               tmp2 = tmp1
           END IF
       ELSE
           INCR count
       END IF
   NEXT

   IF count > 1 THEN outP = outP & TRIM$(STR$(count))
   outP = outP & tmp2
   FUNCTION = outP

END FUNCTION

FUNCTION PBMAIN () AS LONG

   DIM initial AS STRING, encoded AS STRING, decoded AS STRING
   initial = INPUTBOX$("Type something.")
   encoded = RLEncode(initial)
   decoded = RLDecode(encoded)
   MSGBOX initial & $CRLF & encoded & $CRLF & decoded

END FUNCTION</lang>

Sample output (last example shows error from numbers in input string):

aaaaeeeeeeiiiioooouuy
4a6e4i4o2uy
aaaaeeeeeeiiiioooouuy

My dog has fleas.
My dog has fleas.
My dog has fleas.

qqq1www2eee3rrr
3q13w23e33r
qqqwwwwwwwwwwwwweeeeeeeeeeeeeeeeeeeeeeerrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr

PowerShell

<lang powershell>function Compress-RLE ($s) {

   $re = [regex] '(.)\1*'
   $ret = ""
   foreach ($m in $re.Matches($s)) {
       $ret += $m.Length
       $ret += $m.Value[0]
   }
   return $ret

}

function Expand-RLE ($s) {

   $re = [regex] '(\d+)(.)'
   $ret = ""
   foreach ($m in $re.Matches($s)) {
       $ret += [string] $m.Groups[2] * [int] [string] $m.Groups[1]
   }
   return $ret

}</lang> Output:

PS> Compress-RLE "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW"
12W1B12W3B24W1B14W
PS> Expand-RLE "12W1B12W3B24W1B14W"
WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW

Python

<lang python>def encode(input_string):

   count = 1
   prev = 
   lst = []
   for character in input_string:
       if character != prev:
           if prev:
               entry = (prev,count)
               lst.append(entry)
               #print lst
           count = 1
           prev = character
       else:
           count += 1
   else:
       entry = (character,count)
       lst.append(entry)
   return lst

def decode(lst):

   q = ""
   for character, count in lst:
       q += character * count
   return q

Method call

encode("aaaaahhhhhhmmmmmmmuiiiiiiiaaaaaa") decode([('a', 5), ('h', 6), ('m', 7), ('u', 1), ('i', 7), ('a', 6)])</lang>

Functional

Works with: Python version 2.4

<lang python>from itertools import groupby def encode(input_string):

   return [(len(list(g)), k) for k,g in groupby(input_string)]

def decode(lst):

   return .join(c * n for n,c in lst)

encode("aaaaahhhhhhmmmmmmmuiiiiiiiaaaaaa") decode([(5, 'a'), (6, 'h'), (7, 'm'), (1, 'u'), (7, 'i'), (6, 'a')])</lang>

By regular expression
The simplified input range of only uppercase characters allows a simple regular expression to be applied repeatedly for encoding, and another for decoding: <lang python>from re import sub

def encode(text):

   
   Doctest:
       >>> encode('WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW')
       '12W1B12W3B24W1B14W'    
   
   return sub(r'(.)\1*', lambda m: str(len(m.group(0))) + m.group(1),
              text)

def decode(text):

   
   Doctest:
       >>> decode('12W1B12W3B24W1B14W')
       'WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW'
   
   return sub(r'(\d+)(\D)', lambda m: m.group(2) * int(m.group(1)),
              text)

textin = "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW" assert decode(encode(textin)) == textin</lang>

R

R has a built-in function, rle, for run length encoding. This modification allows input and output in the forms specified above. <lang R> runlengthencoding <- function(x) {

  splitx <- unlist(strsplit(input, ""))
  rlex <- rle(splitx)
  paste(with(rlex, as.vector(rbind(lengths, values))), collapse="")

}

input <- "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW" runlengthencoding(input) </lang> Similarly, inverse.rle provides decompression after a run length encoding. <lang R> inverserunlengthencoding <- function(x) {

   lengths <- as.numeric(unlist(strsplit(output, "alpha:")))
   values <- unlist(strsplit(output, "digit:"))
   values <- values[values != ""]
   uncompressed <- inverse.rle(list(lengths=lengths, values=values))
   paste(uncompressed, collapse="")

}

output <- "12W1B12W3B24W1B14W" inverserunlengthencoding(output) </lang>

Ruby

<lang ruby>def encode(string)

 string.scan(/(.)(\1*)/).collect do |matchgroups|
   char, repeat = matchgroups
   [char, 1 + repeat.length] 
 end

end

def decode(encoding)

 encoding.inject("") do |decoding, pair|
   char, length = pair
   decoding << char * length
 end

end

orig = "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW" enc = encode(orig) # => [["W", 12], ["B", 1], ["W", 12], ["B", 3], ["W", 24], ["B", 1], ["W", 14]] dec = decode(enc) puts "success!" if dec == orig</lang>

This usage also seems to be idiomatic, and perhaps less cryptic: <lang ruby>def encode(string)

 encoding = []
 for char, repeat in string.scan(/(.)(\1*)/)
   encoding << [char, 1 + repeat.length]
 end
 encoding

end

def decode(encoding)

 decoding = ""
 for char, length in encoding
   decoding << char * length
 end
 decoding

end</lang>

Smalltalk

Works with: GNU Smalltalk

The class RunLengthEnc holds a representation of a run length encoded sequence of objects.

<lang smalltalk>Object subclass: RunLengthEnc [

 |counters contents|
 <category: 'Compression'>
 <comment: 'My instances are similar to a Bag, except

that items are ordered and counted iff they are adjacent. So that this instance keeps a representation of the added items suitable for performing a RunLengthEncoding, hence the name.'>

 RunLengthEnc class >> new [ ^self basicNew initialize ]

 size [ ^counters size ]

 add: anObject [ ^(self add: anObject withCount: 1) ]

 add: anObject withCount: anInt [
   anObject isNil
       ifTrue: [ 
           SystemExceptions.InvalidArgument signalOn: anObject
             reason: 'RunLengthEnc encodes existing objects, e.g. integers or characters, not nil' 
       ].
   (self size) > 0
   ifTrue: [
     (contents last) == anObject
         ifTrue: [
            self incrementLastCounter: anInt.
         ]
         ifFalse: [

self appendObject: anObject withCount: anInt

         ]
   ] ifFalse: [ self appendObject: anObject withCount: anInt ].
   ^anObject
 ]

 initialize [
   counters := OrderedCollection new.
   contents := OrderedCollection new.
 ]

 appendObject: anObject withCount: anInt [
   contents addLast: anObject.
   counters addLast: anInt
 ]

 appendObject: anObject [
   contents addLast: anObject.
   counters addLast: 1
 ]

 incrementLastCounter: howMuch [ | c |
   c := counters removeLast.
   counters addLast: (c+howMuch)
 ]

 "the 'do:' can be used to let the user store the compressed 'stream' as s/he
  prefers, while 'add:withCount:' can be used to rebuild the informations from
  the custom storage" 
 do: aBlock [
   1 to: (counters size) do: [ :i | | l |
     aBlock value: (contents at: i) value: (counters at: i)
   ]
 ]

 asOrderedCollection [ |oc|
   oc := OrderedCollection new.
   self do: [ :o :c |
     1 to: c do: [ :i| oc add: o ]
   ].
   ^oc
 ]

 printOn: aStream [
     "output a representation of the object:
      counter [object] ... for every object"
     1 to: (counters size) do: [ :i |
        (counters at: i) printOn: aStream.

aStream nextPut: (Character value: 32). (contents at: i) printOn: aStream. aStream nextPut: (Character nl).

     ]
 ]

 asString [ |oc| 
   "'build' a string from the run length encoded objects;
    works only if objects are Characters or Strings"
   oc := self asOrderedCollection.
   ^(oc asString)
 ]

].</lang>

The following extension to the OrderedCollection class allows to run length encode an ordered collection (theoretically of any objects' kind, but the RunLengthEnc class is supposed to work with characters mainly).

<lang smalltalk>OrderedCollection extend [

  asRunLengthEnc [ |rc|
      rc := RunLengthEnc new.
      self do: [ :o |
         rc add: o
      ].
      ^rc
  ]

].</lang>

The following extention to the String class allows to run length encode a string (it is basically a shortcut for aString asOrderedCollection asRunLengthEnc).

<lang smalltalk>String extend [

  asRunLengthEnc [ ^self asOrderedCollection asRunLengthEnc ]

].</lang>

Usage examples

<lang smalltalk>|cs s os|

s := 'WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW'.

"let us run length encode the string" cs := s asRunLengthEnc. cs print. "this shows an ASCII representation of the run length encoded objects collection;

          in this case, of the string"

"let us show that the class is able to return the string back; this really works

iff the objects of the collection are strings or characters"

cs asString displayNl.</lang>

The class does not mandate a way of storing itself into a file that can be loaded later. The following sample code shows how it could be done quickly, but not efficiently from the point of view of a compressor.

<lang smalltalk>|f| "let's write the object and its informations to a file..." f := FileStream open: 'rledump' mode: FileStream write. ObjectDumper dump: cs to: f. f close.

"... and let's read it back" |o| f := FileStream open: 'rledump' mode: FileStream read. o := ObjectDumper loadFrom: f. o print. "show that the object holds the same informations of cs" f close.</lang>

Tcl

The encoding is an even-length list with elements {count char ...} <lang tcl>proc encode {string} {

   set encoding {}
   # use a regular expression to match runs of one character
   foreach {run -} [regexp -all -inline {(.)\1+|.} $string] {
       lappend encoding [string length $run] [string index $run 0]
   }
   return $encoding

}

proc decode {encoding} {

   foreach {count char} $encoding  {
       append decoded [string repeat $char $count]
   }
   return $decoded

}</lang>

<lang tcl>set str "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW" set enc [encode $str] ;# ==> {12 W 1 B 12 W 3 B 24 W 1 B 14 W} set dec [decode $enc] if {$str eq $dec} {

   puts "success"

}</lang>

Ursala

A standard library function, rlc, does most of the work for this task, which is a second order function taking a binary predicate that decides when consecutive items of an input list belong to the same run. <lang Ursala>#import std

import nat

encode = (rlc ==); *= ^lhPrNCT\~&h %nP+ length

decode = (rlc ~&l-=digits); *=zyNCXS ^|DlS/~& iota+ %np

test_data = 'WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW'

show+

example =

<

  encode test_data,
  decode encode test_data></lang>

The output shows an encoding of the test data, and a decoding of the encoding, which matches the original test data.

12W1B12W3B24W1B14W
WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW

Vedit macro language

The following example encodes/decodes an entire file. Each run is coded with two bytes. The first byte is the run length with high bit set, the second byte is the character code. ASCII characters with run length of 1 are left unchanged. Character codes above 127 are always coded with run length. Newlines are not converted (the regular expression does not count newlines). This methods supports any type of input. <lang vedit>

RL_ENCODE:

BOF While (!At_EOF) {

   if (At_EOL) { Line(1) Continue }    // skip newlines
   #1 = Cur_Char                       // #1 = character
   Match("(.)\1*", REGEXP)             // count run length
   #2 = Chars_Matched                  // #2 = run length
   if (#2 > 127) { #2 = 127 }          // can be max 127
   if (#2 > 1 || #1 > 127) {
       Del_Char(#2)
       Ins_Char(#2 | 128)              // run length (high bit set)
       Ins_Char(#1)                    // character
   } else {                            // single ASCII char
       Char                            // skip
   }

} Return

RL_DECODE:

BOF While (!At_EOF) {

   #2 = Cur_Char
   if (#2 > 127) {                     // is this run length?
       #1 = Cur_Char(1)                // #1 = character value
       Del_Char(2)
       Ins_Char(#1, COUNT, #2 & 127)
   } else {                            // single ASCII char
       Char
   }

} Return </lang>