Run-length encoding: Difference between revisions
m (→{{header|Java}}: Changed to static methods, showed tests that were previously commented) |
m (→{{header|Java}}: Should have been a libheader) |
||
Line 619: | Line 619: | ||
Tests: |
Tests: |
||
{{ |
{{libheader|JUnit}} |
||
<lang java>import static org.junit.Assert.assertEquals; |
<lang java>import static org.junit.Assert.assertEquals; |
||
Revision as of 17:58, 21 October 2009
This page uses content from Wikipedia. The original article was at Run-length_encoding. The list of authors can be seen in the page history. As with Rosetta Code, the text of Wikipedia is available under the GNU FDL. (See links for details on variance) |
You are encouraged to solve this task according to the task description, using any language you may know.
Given a string containing uppercase characters (A-Z), compress repeated 'runs' of the same character by storing the length of that run, and provide a function to reverse the compression. The output can be anything, as long as you can recreate the input with it.
Example:
- Input:
WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW
- Output:
12W1B12W3B24W1B14W
See also the Look-and-say sequence.
Ada
<lang Ada> with Ada.Text_IO; use Ada.Text_IO; with Ada.Strings.Fixed; use Ada.Strings.Fixed; procedure Test_Run_Length_Encoding is
function Encode (Data : String) return String is begin if Data'Length = 0 then return ""; else declare Code : constant Character := Data (Data'First); Index : Integer := Data'First + 1; begin while Index <= Data'Last and then Code = Data (Index) loop Index := Index + 1; end loop; declare Prefix : constant String := Integer'Image (Index - Data'First); begin return Prefix (2..Prefix'Last) & Code & Encode (Data (Index..Data'Last)); end; end; end if; end Encode; function Decode (Data : String) return String is begin if Data'Length = 0 then return ""; else declare Index : Integer := Data'First; Count : Natural := 0; begin while Index < Data'Last and then Data (Index) in '0'..'9' loop Count := Count * 10 + Character'Pos (Data (Index)) - Character'Pos ('0'); Index := Index + 1; end loop; if Index > Data'First then return Count * Data (Index) & Decode (Data (Index + 1..Data'Last)); else return Data; end if; end; end if; end Decode;
begin
Put_Line (Encode ("WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW")); Put_Line (Decode ("12W1B12W3B24W1B14W"));
end Test_Run_Length_Encoding; </lang> Sample output:
12W1B12W3B24W1B14W WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW
AWK
It works with "textual" input. Lines containing numbers are skipped, since they can't be represented in a not ambiguous way in this implementation (e.g. "11AA" would be encoded as "212A", which would be decoded as A repeated 212 times!)
Encoding
<lang awk>BEGIN {
FS=""
} /^[^0-9]+$/ {
cp = $1; j = 0 for(i=1; i <= NF; i++) { if ( $i == cp ) { j++; } else { printf("%d%c", j, cp) j = 1 } cp = $i } printf("%d%c", j, cp)
}</lang>
Decoding
<lang awk>BEGIN {
RS="[0-9]+[^0-9]" final = "";
} {
match(RT, /([0-9]+)([^0-9])/, r) for(i=0; i < int(r[1]); i++) { final = final r[2] }
} END {
print final
}</lang>
ALGOL 68
Note: The following uses iterators, eliminating the need of declaring arbitrarily large CHAR arrays for caching. <lang algol>STRING input := "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW"; STRING output := "12W1B12W3B24W1B14W";
MODE ITERATOR = VOID; MODE YIELDCHAR = PROC(CHAR)ITERATOR; MODE ITERCHAR = PROC(YIELDCHAR)ITERATOR;
PROC char seq = (REF STRING s, YIELDCHAR yield)ITERATOR:
FOR i FROM LWB s TO UPB s DO yield(s[i]) OD;
- Note: The following 2 lines use currying. This not supported by ELLA ALGOL 68RS #
ITERCHAR input seq = char seq(input,),
output seq = char seq(output,);
PROC for encode = (ITERCHAR for seq, YIELDCHAR yield)ITERATOR: (
INT count := 0; CHAR prev;
- FOR c IN seq DO #
for seq((CHAR c)VOID: ( IF count = 0 THEN count := 1; prev := c ELIF c /= prev THEN STRING str count := whole(count,0); char seq(str count, yield); count := 1; yield(prev); prev := c ELSE count +:=1 FI ) )
- OD #;
IF count /= 0 THEN STRING str count := whole(count,0); char seq(str count,yield); yield(prev) FI
);
STRING zero2nine = "0123456789";
PROC for decode = (ITERCHAR for seq, YIELDCHAR yield)ITERATOR: (
INT repeat := 0;
- FOR c IN seq DO #
for seq((CHAR c)VOID: ( IF char in string(c, NIL, zero2nine) THEN repeat := repeat*10 + ABS c - ABS "0" ELSE FOR i TO repeat DO yield(c) OD; repeat := 0 FI ))
);
- iterate through input string #
print("Encode input: ");
- FOR c IN encode(input seq) DO #
for encode(input seq, (CHAR c)VOID: print(c) )
- OD #;
print(new line);
print("Decode output: ");
- FOR c IN decode(output seq) DO #
for decode(output seq, (CHAR c)VOID: print(c) )
- OD #;
print(new line)</lang>
Encode input: 12W1B12W3B24W1B14W Decode output: WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW
AutoHotkey
<lang AutoHotkey> MsgBox % key := rle_encode("WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW") MsgBox % rle_decode(key)
rle_encode(message) {
StringLeft, previous, message, 1 StringRight, last, message, 1 message .= Asc(Chr(last)+1) count = 0 Loop, Parse, message { If (previous == A_LoopField) count +=1 Else { output .= previous . count previous := A_LoopField count = 1 } } Return output
}
rle_decode(message) {
pos = 1 While, item := RegExMatch(message, "\D", char, pos) { digpos := RegExMatch(message, "\d+", dig, item) Loop, % dig output .= char pos := digpos } Return output
} </lang>
C
These functions have no check for the size of the output buffers.
Encoding function
Since repeat counter must fit a single byte in this implementation, it can't be greater than 255, so a byte repeated more than 255 times generates in the compressed stream more than 2 bytes (4 bytes if the length of the repeated byte sequence is less than 511 and so on)
<lang c>int rle_encode(char *out, const char *in, int l) {
int dl, i; char cp, c;
for(cp=c= *in++, i = 0, dl=0; l>0 ; c = *in++, l-- ) { if ( c == cp ) { i++; if ( i > 255 ) {
*out++ = 255; *out++ = c; dl += 2; i = 1;
} } else { *out++ = i; *out++ = cp; dl += 2; i = 1; } cp = c; } *out++ = i; *out++ = cp; dl += 2; return dl;
}</lang>
Decoding function
<lang c>int rle_decode(char *out, const char *in, int l) {
int i, j, tb; char c;
for(tb=0 ; l>=0 ; l -= 2 ) { i = *in++; c = *in++; tb += i; while(i-- > 0) *out++ = c; } return tb;
}</lang>
Usage example
<lang c>#include <stdio.h>
- include <stdlib.h>
- include <string.h>
const char *o = "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW";
int main() {
char *d = malloc(2*strlen(o)); char *oc = malloc(strlen(o)); int rl = rle_encode(d, o, strlen(o)); /* fwrite(d, 1, rl, stdout); */
int ocl = rle_decode(oc, d, rl); fwrite(oc, 1, ocl, stdout);
free(d); free(oc); return 0;
}</lang>
In the following codes, encoding and decoding are implementend as "filters" which compress/decompress standard input to standard output writing ASCII strings; they will work as long as the input has no ASCII digits in it, and the compressed/original ratio of a "single group" will be less than or equal to 1 as long as the ASCII decimal representation's length of the repeat counter will be shorter than the length of the "group". It should be so except in the case the group is a single isolated character, e.g. B gives 1B (one byte against two compressed bytes)
Encoding filter
<lang c>#include <stdio.h>
int main() {
int i, c, cp;
for(cp=c=getchar(), i = 0; c != EOF; c = getchar() ) { if ( c == cp ) { i++; } else { printf("%d%c", i, cp); i = 1; } cp = c; } printf("%d%c", i, cp);
}</lang>
Decoding filter
<lang c>#include <stdio.h>
int main() {
int i, c, j;
while( scanf("%d%c", &i, &c) == 2 ) { for(j=0; j < i; j++) printf("%c", c); }
}</lang>
Final note: since the repeat counter value 0 has no meaning, it could be used as it would be 256, so extending by one the maximum number of repetitions representable with a single byte; or instead it could be used as a special marker to encode in a more efficient way (long) sequences of isolated characters, e.g. "ABCDE" would be encoded as "1A1B1C1D1E"; it could be instead encoded as "05ABCDE".
Clojure
<lang clojure> (defn step [[result, n, c], new] ;function used in encode's reduce call
(cond (zero? n) [result, 1, new] (= c new) [result, (inc n), c] :else [(conj result [n c]), 1, new]))
(defn encode [s]
(let [[result,n,chr] (reduce step [[],0,nil] s)] (if (= n 0) result (conj result [n chr]))))
(defn decode [v]
(let [expand (fn n c (repeat n c))] (apply str (mapcat expand v))))
</lang> <lang clojure> (def uncoded "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW") (def coded [[12 \W] [1 \B] [12 \W] [3 \B] [24 \W] [1 \B] [14 \W]])
(assert (= (encode uncoded) coded)) (assert (= (decode coded) uncoded)) </lang>
Common Lisp
<lang lisp>(defun group-similar (sequence &key (test 'eql))
(loop for x in (rest sequence) with temp = (subseq sequence 0 1) if (funcall test (first temp) x) do (push x temp) else collect temp and do (setf temp (list x))))
(defun run-length-encode (sequence)
(mapcar (lambda (group) (list (first group) (length group))) (group-similar (coerce sequence 'list))))
(defun run-length-decode (sequence)
(reduce (lambda (s1 s2) (concatenate 'simple-string s1 s2)) (mapcar (lambda (elem) (make-string (second elem) :initial-element (first elem))) sequence)))
(run-length-encode "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW") (run-length-decode '((#\W 12) (#\B 1) (#\W 12) (#\B 3) (#\W 24) (#\B 1)))</lang>
D
<lang d> import std.stdio; import std.string; void main() {
char[]rle = "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW"; char[]encoded = encode(rle); char[]decoded = decode(encoded); writefln("\"%s\" == \"%s\", intermediary %s",rle,decoded,encoded); assert(rle == decoded);
}
// this is essentially an exact copy of the look and say D function char[]encode(char[]input) {
char last = input[$-1]; char[]output; int count = 0; foreach_reverse(i;input) { if (i == last) { count++; } else { output = toString(count)~last~output; count = 1; last = i; } } output = toString(count)~last~output; return output;
}
char[]decode(char[]input) {
char[]i = ""; char[]ret; foreach(letter;input) { if (letter <= 'Z' && letter >= 'A') { // this is the letter to be repeated if (!i.length) throw new Exception("Can not repeat a letter without a number of repetitions"); ret ~= repeat([letter],atoi(i)); i = null; } else if (letter <= '9' && letter >= '0') { // this is a digit to mark the number of repetitions i ~= letter; } else { throw new Exception("'"~letter~"' is not capalphanumeric"); } } return ret;
} </lang>
E
<lang e>def rle(string) {
var seen := null var count := 0 var result := [] def put() { if (seen != null) { result with= [count, seen] } } for ch in string { if (ch != seen) { put() seen := ch count := 0 } count += 1 } put() return result
}
def unrle(coded) {
var result := "" for [count, ch] in coded { result += E.toString(ch) * count } return result
}</lang>
<lang e>? rle("WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW")
- value: [[12, 'W'], [1, 'B'], [12, 'W'], [3, 'B'], [24, 'W'], [1, 'B'], [14, 'W']]
? unrle(rle("WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW"))
- value: "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW"</lang>
FALSE
<lang false> 1^[^$~][$@$@=$[%%\1+\$0~]?~[@.,1\$]?%]#%\., {encode} </lang> <lang false> [0[^$$'9>'0@>|~]['0-\10*+]#]n: [n;!$~][[\$][1-\$,]#%%]#%% {decode} </lang>
Fan
<lang Fan>
- Generates a run-length encoding for a string
class RLE {
Run[] encode(Str s) { runs := Run[,]
s.size.times |i| { ch := s[i] if (runs.size==0 || runs.last.char != ch) runs.add(Run(ch)) runs.last.inc } return runs }
Str decode(Run[] runs) { buf := StrBuf() runs.each |run| { run.count.times { buf.add(run.char.toChar) } } return buf.toStr }
Void main() { echo(decode(encode(
"WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW"
))) }
}
internal class Run {
Int char Int count := 0 new make(Int ch) { char = ch } Void inc() { ++count }
override Str toStr() { return "${count}${char.toChar}" }
} </lang>
Haskell
<lang haskell>import Data.List (group)
-- Datatypes type Encoded = [(Int, Char)] -- An encoded String with form [(times, char), ...] type Decoded = String
-- Takes a decoded string and returns an encoded list of tuples rlencode :: Decoded -> Encoded rlencode = map (\g -> (length g, head g)) . group
-- Takes an encoded list of tuples and returns the associated decoded String rldecode :: Encoded -> Decoded rldecode = concatMap decodeTuple
where decodeTuple (n,c) = replicate n c
main :: IO () main = do
-- Get input putStr "String to encode: " input <- getLine -- Output encoded and decoded versions of input let encoded = rlencode input decoded = rldecode encoded putStrLn $ "Encoded: " ++ show encoded ++ "\nDecoded: " ++ show decoded</lang>
J
<lang j>
NB. Run-length encode and decode... rle=: ([: ; (":@# , {.)&.>)@((1 , }. ~: }:) <;.1 ]) rle 'WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW'
12W1B12W3B24W1B14W
rld=. '0123456789'&((i. ".@:{ ' ' ,~ [) # -.@:e.~ # ]) rld'12W1B12W3B24W1B14W'
WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW </lang>
Java
<lang java>import java.util.regex.Matcher; import java.util.regex.Pattern; public class RunLengthEncoding {
public static String encode(String source) { StringBuffer dest = new StringBuffer(); for (int i = 0; i < source.length(); i++) { int runLength = 1; while (i+1 < source.length() && source.charAt(i) == source.charAt(i+1)) { runLength++; i++; } dest.append(runLength); dest.append(source.charAt(i)); } return dest.toString(); }
public static String decode(String source) { StringBuffer dest = new StringBuffer(); Pattern pattern = Pattern.compile("[0-9]+|[a-zA-Z]"); Matcher matcher = pattern.matcher(source); while (matcher.find()) { int number = Integer.parseInt(matcher.group()); matcher.find(); while (number-- != 0) { dest.append(matcher.group()); } } return dest.toString(); }
public static void main(String[] args) { String example = "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW"; System.out.println(encode(example)); System.out.println(decode("1W1B1W1B1W1B1W1B1W1B1W1B1W1B")); }
}</lang> Tests:
<lang java>import static org.junit.Assert.assertEquals;
import org.junit.Test;
public class RunLengthEncodingTest { private RLE = new RunLengthEncoding();
@Test public void encodingTest() { assertEquals("1W", RLE.encode("W")); assertEquals("4W", RLE.encode("WWWW")); assertEquals("5w4i7k3i6p5e4d2i1a", RLE.encode("wwwwwiiiikkkkkkkiiippppppeeeeeddddiia")); assertEquals("12B1N12B3N24B1N14B", RLE.encode("BBBBBBBBBBBBNBBBBBBBBBBBBNNNBBBBBBBBBBBBBBBBBBBBBBBBNBBBBBBBBBBBBBB")); assertEquals("12W1B12W3B24W1B14W", RLE.encode("WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW")); assertEquals("1W1B1W1B1W1B1W1B1W1B1W1B1W1B", RLE.encode("WBWBWBWBWBWBWB"));
}
@Test public void decodingTest() { assertEquals("W", RLE.decode("1W")); assertEquals("WWWW", RLE.decode("4W")); assertEquals("WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW", RLE.decode("12W1B12W3B24W1B14W")); assertEquals("WBWBWBWBWBWBWB", RLE.decode("1W1B1W1B1W1B1W1B1W1B1W1B1W1B")); assertEquals("WBWBWBWBWBWBWB", RLE.decode("1W1B1W1B1W1B1W1B1W1B1W1B1W1B"));
} }</lang>
JavaScript
Here's an encoding method that walks the input string character by character <lang javascript>function encode(input) {
var encoding = []; var prev, count, i; for (count = 1, prev = input[0], i = 1; i < input.length; i++) { if (input[i] != prev) { encoding.push([count, prev]); count = 1; prev = input[i]; } else count ++; } encoding.push([count, prev]); return encoding;
}</lang>
Here's an encoding method that uses a regular expression to grab the character runs (
for the forEach
method)
<lang javascript>function encode_re(input) {
var encoding = []; input.match(/(.)\1*/g).forEach(function(substr){ encoding.push([substr.length, substr[0]]) }); return encoding;
}</lang>
And to decode (see Repeating a string) <lang javascript>function decode(encoded) {
var output = ""; encoded.forEach(function(pair){ output += new Array(1+pair[0]).join(pair[1]) }) return output;
}</lang>
Mathematica
Custom functions using Map, Apply, pure functions, replacing using pattern matching, delayed rules and other functions: <lang Mathematica>
RunLengthEncode[input_String]:=StringJoin@@Sequence@@@({ToString @Length[#],First[#]}&/@Split[Characters[input]]) RunLengthDecode[input_String]:=StringJoin@@ConstantArray@@@Reverse/@Partition[(Characters[input]/.(ToString[#]->#&/@Range[0,9]))//.{x___,i_Integer,j_Integer,y___}:>{x,10i+j,y},2]
</lang> Example: <lang Mathematica>
mystring="WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW"; RunLengthEncode[mystring] RunLengthDecode[%] %==mystring
</lang> gives back: <lang Mathematica>
12W1B12W3B24W1B14W WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW True
</lang>
Objective-C
The class RCRunLengthEncoder represents internally data with which it was feeded as pair character - repetition counter: it does not implement a binary representation of itself; it is left to another class, so that different input/output encodings are possible starting from the same class.
<lang objc>#import <Foundation/Foundation.h>
@interface RCRunLengthEncoder : NSObject {
NSMutableArray *contents; NSMutableArray *counters;
} + (id)alloc; + (id)encoderWithData: (NSData *)data; - (id)init; - (id)initWithData: (NSData *)data; - (void)dealloc; - (void)addByte: (char)aByte; - (void)addByte: (char)aByte repeated: (int)repetitionCount; - (void)addData: (NSData *)data; - (NSData *)data; - (NSArray *)counters; - (NSArray *)contents; @end
@implementation RCRunLengthEncoder
+ (id)encoderWithData: (NSData *)data
{
return [[[self alloc] initWithData: data] autorelease];
}
- (id)initWithData: (NSData *)data {
if ((self = [self init]) != nil) { [self addData: data]; } return self;
}
- (id)init {
if ((self = [super init]) != nil) { contents = [[NSMutableArray alloc] init]; counters = [[NSMutableArray alloc] init]; } return self;
}
- (void)dealloc {
[counters release]; [contents release]; [super dealloc];
}
- (void)addByte: (char)aByte {
[self addByte: aByte repeated: 1];
}
- (void)addByte: (char)aByte repeated: (int)repetitionCount {
if ( ([contents count] == 0) || ([[contents lastObject] charValue] != aByte) ) { [contents addObject: [NSNumber numberWithChar: aByte]]; [counters addObject: [NSNumber numberWithInt: repetitionCount]]; } else { NSNumber *a = [counters lastObject]; [counters removeLastObject]; [counters addObject: [NSNumber numberWithInt:
([a intValue] + repetitionCount) ]];
}
}
- (void)addData: (NSData *)data {
const char *d = [data bytes]; NSUInteger i; for(i=0; i < [data length]; i++) [self addByte: d[i]];
}
- (NSArray *)contents {
return contents;
}
- (NSArray *)counters {
return counters;
}
- (NSData *)data {
NSMutableData *d = [[NSMutableData alloc] initWithCapacity: 256]; char buf[256]; int i;
for(i=0; i < [contents count]; i++) { char c = [[contents objectAtIndex: i] charValue]; int n = [[counters objectAtIndex: i] intValue]; memset(buf, c, 256); while ( n > 0 ) { [d appendBytes: buf length: MIN(256, n)]; n -= 256; } } return [d autorelease];
} @end</lang>
The class codecRLE is derived from the previous, adding the methods that allow to binary encode the data internally held, and to create a internal representation from the encoded data. The specification here used are:
- byte N >= 128, then the next byte must be repeated N-128 times (if N-128 is 0, the the byte must be repeated 128 times)
- byte N < 128, then the next N bytes must be taken as they are (if N is 0, the next 128 bytes are literal)
<lang objc>@interface codecRLE : RCRunLengthEncoder - (NSData *)encode; - (void)decode: (NSData *)enc; @end
@implementation codecRLE - (void)decode: (NSData *)enc {
const char *buf = [enc bytes]; int i;
for(i = 0; i < [enc length]; ) { if ( (buf[i] & 0x80) != 0) { [self addByte: buf[i+1] repeated: ( ((buf[i]&0x7f) == 0) ? 128 : (buf[i]&0x7f) ) ]; i += 2; } else { int rc = (buf[i] == 0) ? 128 : buf[i]; [self addData: [NSData dataWithBytes: &buf[i+1] length: rc]]; i += rc + 1; } }
}
- (NSData *)encode {
int literalCount=0; int i; char buf[129];
NSMutableData *r = [[NSMutableData alloc] initWithCapacity: 256];
for(i=0; i < [counters count]; i++) { char c = [[contents objectAtIndex: i] charValue]; int howMany = [[counters objectAtIndex: i] intValue]; if ( literalCount == 128 ) { buf[0] = 0; [r appendBytes: buf length: 129]; literalCount = 0; } if ( howMany == 1 ) { buf[literalCount+1] = c; literalCount++; } else { if ( literalCount > 0 ) {
buf[0] = literalCount & 0x7f; [r appendBytes: buf length: (literalCount+1) ]; literalCount = 0;
} buf[1] = c; while( howMany > 127 ) {
buf[0] = 0x80; [r appendBytes: buf length: 2]; howMany -= 128;
} if (howMany > 0) {
buf[0] = howMany | 0x80; [r appendBytes: buf length: 2];
} } } return [r autorelease];
} @end</lang>
Usage example:
<lang objc>const char *s = "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW";
int main() {
NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];
codecRLE *enc = [[codecRLE alloc]
initWithData: [NSData dataWithBytes: s length: strlen(s)] ];
NSData *repr = [enc encode]; fwrite([repr bytes], 1, [repr length], stdout); [enc release];
enc = [[codecRLE alloc] init]; [enc decode: repr]; NSData *d = [enc data]; fwrite([d bytes], 1, [d length], stdout); [enc release];
[pool release]; return EXIT_SUCCESS;
}</lang>
Notes
- The code is not deeply tested yet
OCaml
<lang ocaml>let encode str =
let len = String.length str in let rec aux i acc = if i >= len then List.rev acc else let c1 = str.[i] in let rec aux2 j = if j >= len then (c1, j-i) else let c2 = str.[j] in if c1 = c2 then aux2 (j+1) else (c1, j-i) in let (c,n) as t = aux2 (i+1) in aux (i+n) (t::acc) in aux 0 []
let decode lst =
let l = List.map (fun (c,n) -> String.make n c) lst in (String.concat "" l)</lang>
<lang ocaml>let () =
let e = encode "aaaaahhhhhhmmmmmmmuiiiiiiiaaaaaa" in List.iter (fun (c,n) -> Printf.printf " (%c, %d);\n" c n; ) e; print_endline (decode [('a', 5); ('h', 6); ('m', 7); ('u', 1); ('i', 7); ('a', 6)]);
- </lang>
Perl
<lang perl>sub encode
{my $str = shift; $str =~ s {(.)(\1*)} {(length($2) + 1) . $1 . ';'}gse; return $str;}
sub decode
{my $str = shift; $str =~ s {(\d+)(.);} {$2 x $1}gse; return $str;}</lang>
The following modified versions of the previous one, encode/decode a bytes sequence in a way compatible with the functions of the C version.
<lang perl>sub encode
{my $str = shift; $str =~ s {(.)(\1{0,254})} {pack("C",(length($2) + 1)) . $1 }gse; return $str;}
sub decode {
my @str = split //, shift; my $r = ""; foreach my $i (0 .. scalar(@str)/2-1) {
$r .= $str[2*$i + 1] x unpack("C", $str[2*$i]);
} return $r;
}</lang>
PowerBASIC
In this example, a character that only appears once isn't given a number in the encoded sequence. This prevents the code from getting longer. (A string without any runs is returned unchanged.) This version can also handle any arbitrary string that doesn't contain numbers (not just letters). (A flag value could be added which would allow the inclusion of any character, but such a flag isn't in this example.)
<lang powerbasic>FUNCTION RLDecode (i AS STRING) AS STRING
DIM Loop0 AS LONG, count AS STRING, outP AS STRING, m AS STRING
FOR Loop0 = 1 TO LEN(i) m = MID$(i, Loop0, 1) SELECT CASE m CASE "0" TO "9" count = count & m CASE ELSE IF LEN(count) THEN outP = outP & STRING$(VAL(count), m) count="" ELSE outP = outP & m END IF END SELECT NEXT FUNCTION = outP
END FUNCTION
FUNCTION RLEncode (i AS STRING) AS STRING
DIM tmp1 AS STRING, tmp2 AS STRING, outP AS STRING DIM Loop0 AS LONG, count AS LONG
FOR Loop0 = 1 TO LEN(i) tmp1 = MID$(i, Loop0, 1) IF tmp1 <> tmp2 THEN IF count > 1 THEN outP = outP & TRIM$(STR$(count)) & tmp2 tmp2 = tmp1 count = 1 ELSEIF 0 = count THEN tmp2 = tmp1 count = 1 ELSE outP = outP & tmp2 tmp2 = tmp1 END IF ELSE INCR count END IF NEXT
IF count > 1 THEN outP = outP & TRIM$(STR$(count)) outP = outP & tmp2 FUNCTION = outP
END FUNCTION
FUNCTION PBMAIN () AS LONG
DIM initial AS STRING, encoded AS STRING, decoded AS STRING initial = INPUTBOX$("Type something.") encoded = RLEncode(initial) decoded = RLDecode(encoded) MSGBOX initial & $CRLF & encoded & $CRLF & decoded
END FUNCTION</lang>
Sample output (last example shows error from numbers in input string):
aaaaeeeeeeiiiioooouuy 4a6e4i4o2uy aaaaeeeeeeiiiioooouuy My dog has fleas. My dog has fleas. My dog has fleas. qqq1www2eee3rrr 3q13w23e33r qqqwwwwwwwwwwwwweeeeeeeeeeeeeeeeeeeeeeerrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr
PowerShell
<lang powershell>function Compress-RLE ($s) {
$re = [regex] '(.)\1*' $ret = "" foreach ($m in $re.Matches($s)) { $ret += $m.Length $ret += $m.Value[0] } return $ret
}
function Expand-RLE ($s) {
$re = [regex] '(\d+)(.)' $ret = "" foreach ($m in $re.Matches($s)) { $ret += [string] $m.Groups[2] * [int] [string] $m.Groups[1] } return $ret
}</lang> Output:
PS> Compress-RLE "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW" 12W1B12W3B24W1B14W PS> Expand-RLE "12W1B12W3B24W1B14W" WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW
Python
<lang python>def encode(input_string):
count = 1 prev = lst = [] for character in input_string: if character != prev: if prev: entry = (prev,count) lst.append(entry) #print lst count = 1 prev = character else: count += 1 else: entry = (character,count) lst.append(entry) return lst
def decode(lst):
q = "" for character, count in lst: q += character * count return q
- Method call
encode("aaaaahhhhhhmmmmmmmuiiiiiiiaaaaaa") decode([('a', 5), ('h', 6), ('m', 7), ('u', 1), ('i', 7), ('a', 6)])</lang>
Functional
<lang python>from itertools import groupby def encode(input_string):
return [(len(list(g)), k) for k,g in groupby(input_string)]
def decode(lst):
return .join(c * n for n,c in lst)
encode("aaaaahhhhhhmmmmmmmuiiiiiiiaaaaaa") decode([(5, 'a'), (6, 'h'), (7, 'm'), (1, 'u'), (7, 'i'), (6, 'a')])</lang>
By regular expression
The simplified input range of only uppercase characters allows a simple regular expression to be applied repeatedly for encoding, and another for decoding:
<lang python>from re import finditer
def encode(text):
Doctest: >>> encode('WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW') '12W1B12W3B24W1B14W' return .join( str(len(run.group(0))) + run.group(1) for run in finditer(r'(.)\1*', text) )
def decode(text):
Doctest: >>> decode('12W1B12W3B24W1B14W') 'WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW' return .join( run.group(2) * int(run.group(1)) for run in finditer(r'(\d+)(\D)', text) )
textin = "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW" assert decode(encode(textin)) == textin</lang>
R
R has a built-in function, rle, for run length encoding. This modification allows input and output in the forms specified above. <lang R> runlengthencoding <- function(x) {
splitx <- unlist(strsplit(input, "")) rlex <- rle(splitx) paste(with(rlex, as.vector(rbind(lengths, values))), collapse="")
}
input <- "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW" runlengthencoding(input) </lang> Similarly, inverse.rle provides decompression after a run length encoding. <lang R> inverserunlengthencoding <- function(x) {
lengths <- as.numeric(unlist(strsplit(output, "alpha:"))) values <- unlist(strsplit(output, "digit:")) values <- values[values != ""] uncompressed <- inverse.rle(list(lengths=lengths, values=values)) paste(uncompressed, collapse="")
}
output <- "12W1B12W3B24W1B14W" inverserunlengthencoding(output) </lang>
Ruby
<lang ruby>def encode(string)
string.scan(/(.)(\1*)/).collect do |matchgroups| char, repeat = matchgroups [char, 1 + repeat.length] end
end
def decode(encoding)
encoding.inject("") do |decoding, pair| char, length = pair decoding << char * length end
end
orig = "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW" enc = encode(orig) # => [["W", 12], ["B", 1], ["W", 12], ["B", 3], ["W", 24], ["B", 1], ["W", 14]] dec = decode(enc) puts "success!" if dec == orig</lang>
This usage also seems to be idiomatic, and perhaps less cryptic: <lang ruby>def encode(string)
encoding = [] for char, repeat in string.scan(/(.)(\1*)/) encoding << [char, 1 + repeat.length] end encoding
end
def decode(encoding)
decoding = "" for char, length in encoding decoding << char * length end decoding
end</lang>
Smalltalk
The class RunLengthEnc
holds a representation of a run length encoded sequence of objects.
<lang smalltalk>Object subclass: RunLengthEnc [
|counters contents| <category: 'Compression'> <comment: 'My instances are similar to a Bag, except
that items are ordered and counted iff they are adjacent. So that this instance keeps a representation of the added items suitable for performing a RunLengthEncoding, hence the name.'>
RunLengthEnc class >> new [ ^self basicNew initialize ]
size [ ^counters size ]
add: anObject [ ^(self add: anObject withCount: 1) ]
add: anObject withCount: anInt [ anObject isNil ifTrue: [ SystemExceptions.InvalidArgument signalOn: anObject reason: 'RunLengthEnc encodes existing objects, e.g. integers or characters, not nil' ]. (self size) > 0 ifTrue: [ (contents last) == anObject ifTrue: [ self incrementLastCounter: anInt. ] ifFalse: [
self appendObject: anObject withCount: anInt
] ] ifFalse: [ self appendObject: anObject withCount: anInt ]. ^anObject ]
initialize [ counters := OrderedCollection new. contents := OrderedCollection new. ]
appendObject: anObject withCount: anInt [ contents addLast: anObject. counters addLast: anInt ]
appendObject: anObject [ contents addLast: anObject. counters addLast: 1 ]
incrementLastCounter: howMuch [ | c | c := counters removeLast. counters addLast: (c+howMuch) ]
"the 'do:' can be used to let the user store the compressed 'stream' as s/he prefers, while 'add:withCount:' can be used to rebuild the informations from the custom storage" do: aBlock [ 1 to: (counters size) do: [ :i | | l | aBlock value: (contents at: i) value: (counters at: i) ] ]
asOrderedCollection [ |oc| oc := OrderedCollection new. self do: [ :o :c | 1 to: c do: [ :i| oc add: o ] ]. ^oc ]
printOn: aStream [ "output a representation of the object: counter [object] ... for every object" 1 to: (counters size) do: [ :i | (counters at: i) printOn: aStream.
aStream nextPut: (Character value: 32). (contents at: i) printOn: aStream. aStream nextPut: (Character nl).
] ]
asString [ |oc| "'build' a string from the run length encoded objects; works only if objects are Characters or Strings" oc := self asOrderedCollection. ^(oc asString) ]
].</lang>
The following extension to the OrderedCollection class allows to run length encode an ordered collection (theoretically of any objects' kind, but the RunLengthEnc class is supposed to work with characters mainly).
<lang smalltalk>OrderedCollection extend [
asRunLengthEnc [ |rc| rc := RunLengthEnc new. self do: [ :o | rc add: o ]. ^rc ]
].</lang>
The following extention to the String class allows to run length encode a string (it is basically a shortcut for aString asOrderedCollection asRunLengthEnc
).
<lang smalltalk>String extend [
asRunLengthEnc [ ^self asOrderedCollection asRunLengthEnc ]
].</lang>
Usage examples
<lang smalltalk>|cs s os|
s := 'WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW'.
"let us run length encode the string" cs := s asRunLengthEnc. cs print. "this shows an ASCII representation of the run length encoded objects collection;
in this case, of the string"
"let us show that the class is able to return the string back; this really works
iff the objects of the collection are strings or characters"
cs asString displayNl.</lang>
The class does not mandate a way of storing itself into a file that can be loaded later. The following sample code shows how it could be done quickly, but not efficiently from the point of view of a compressor.
<lang smalltalk>|f| "let's write the object and its informations to a file..." f := FileStream open: 'rledump' mode: FileStream write. ObjectDumper dump: cs to: f. f close.
"... and let's read it back" |o| f := FileStream open: 'rledump' mode: FileStream read. o := ObjectDumper loadFrom: f. o print. "show that the object holds the same informations of cs" f close.</lang>
Tcl
The encoding is an even-length list with elements {count char ...} <lang tcl>proc encode {string} {
set encoding {} # use a regular expression to match runs of one character foreach {run -} [regexp -all -inline {(.)\1+|.} $string] { lappend encoding [string length $run] [string index $run 0] } return $encoding
}
proc decode {encoding} {
foreach {count char} $encoding { append decoded [string repeat $char $count] } return $decoded
}</lang>
<lang tcl>set str "WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW" set enc [encode $str] ;# ==> {12 W 1 B 12 W 3 B 24 W 1 B 14 W} set dec [decode $enc] if {$str eq $dec} {
puts "success"
}</lang>
Ursala
A standard library function, rlc, does most of the work for this task, which is a second order function taking a binary predicate that decides when consecutive items of an input list belong to the same run. <lang Ursala>#import std
- import nat
encode = (rlc ==); *= ^lhPrNCT\~&h %nP+ length
decode = (rlc ~&l-=digits); *=zyNCXS ^|DlS/~& iota+ %np
test_data = 'WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW'
- show+
example =
<
encode test_data, decode encode test_data></lang>
The output shows an encoding of the test data, and a decoding of the encoding, which matches the original test data.
12W1B12W3B24W1B14W WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWWWWWWWWWWWBWWWWWWWWWWWWWW
Vedit macro language
The following example encodes/decodes an entire file. Each run is coded with two bytes. The first byte is the run length with high bit set, the second byte is the character code. ASCII characters with run length of 1 are left unchanged. Character codes above 127 are always coded with run length. Newlines are not converted (the regular expression does not count newlines). This methods supports any type of input. <lang vedit>
- RL_ENCODE:
BOF While (!At_EOF) {
if (At_EOL) { Line(1) Continue } // skip newlines #1 = Cur_Char // #1 = character Match("(.)\1*", REGEXP) // count run length #2 = Chars_Matched // #2 = run length if (#2 > 127) { #2 = 127 } // can be max 127 if (#2 > 1 || #1 > 127) { Del_Char(#2) Ins_Char(#2 | 128) // run length (high bit set) Ins_Char(#1) // character } else { // single ASCII char Char // skip }
} Return
- RL_DECODE:
BOF While (!At_EOF) {
#2 = Cur_Char if (#2 > 127) { // is this run length? #1 = Cur_Char(1) // #1 = character value Del_Char(2) Ins_Char(#1, COUNT, #2 & 127) } else { // single ASCII char Char }
} Return </lang>