Huffman coding: Difference between revisions

From Rosetta Code
Content added Content deleted
m (→‎{{header|C}}: now it compiles with gcc -pedantic -std=c99)
(Replaced original with true Huffman task)
Line 1: Line 1:
{{task}}
{{task}}
Huffman encoding is a way to assign binary codes to symbols that reduces the overall number of bits used to encode a typical string of of those symbols.
{{incorrect||This is not Huffman coding for a general set of frequencies.}}
[[wp:Huffman coding|Huffman coding]] is a way of reducing the number of bits it takes to represent a string of symbols. It is a lossless compression algorithm. In order to generate Huffman codes you need to have a list of symbols sorted by their frequency. Say you have these symbols and frequencies:
<pre>A 50%
B 25%
C 12.5%
D 12.5%</pre>
Right away you might think that the best way to encode these symbols is with two binary bits (since there are only four), but because the frequencies of the characters differ so much, Huffman codes can save some space.


For example, if you use letters as symbols and have details of the frequency of occurence of those letters in typical strings, then you could just encode each letter with a fixed number of bits, such as in ASCII codes. You can do better than this by encoding more frequently occurring letters such as e and a, with smaller bit strings; and less frequently occurring letters such as q and x with longer bit strings.
The slightly humanized procedure for generating Huffman codes for these symbols looks a little like a tree. You start with a vertex and branch out twice from it, labeling the top branch "0" and the bottom branch "1". Each time you make a "0" branch you label the vertex at the end of it with the most frequent symbol on your list. Each time you make a "1" branch you treat the new vertex at the end of it like the previous vertex, until you only have one symbol left, when you label the new "1" vertex with that symbol. The tree for the symbols above looks like this:
<pre> A
0/
/
. B
\ 0/
1\ /
. C
\ 0/
1\ /
.
\
1\
D</pre>
The Huffman codes (sometimes called "codewords") can be read off this tree by going to each symbol starting at the original vertex and appending the branch names:
<pre>A: 0
B: 10
C: 110
D: 111</pre>


Any string of letters will be encoded as a string of bits that are no-longer of the same length per letter. To successfully decode such as string, the smaller codes assigned to letters such as 'e' cannot occur as a prefix in the larger codes such as that for 'x'.
Write a program which reads in symbols and frequencies in pairs separated by whitespace and generates Huffman codes for each symbol. Test your program with the following line of letter frequencies, as input:
:If you were to assign a code 01 for 'e' and code 011 for 'x', then if the bits to decode started as 011... then you would not know iif you should decode an 'e' or an 'x'.
<pre>"B 25 C 2.5 D 12.5 A 5 \n"</pre>


The Huffman coding scheme takes each symbol and its weight (or frequency of occurrence), and generates proper encodings for each symbol taking account of the weights of each symbol, so that higher weighted symbols have less bits in their encoding. (See the [[wp:Huffman_coding|WP article]] for more information).
=={{header|C}}==


A Huffman encoding can be computed by first creating a tree of nodes:
This code lacks a lot of needed checkings, expecially for memory allocation.


[[Image:Huffman_coding_example.jpg|right|250px]]
<lang c>#include <stdio.h>
# Create a leaf node for each symbol and add it to the priority queue.
#include <stdlib.h>
# While there is more than one node in the queue:
#include <string.h>
## Remove the node of highest priority (lowest probability) twice to get two nodes.
## Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities.
## Add the new node to the queue.
# The remaining node is the root node and the tree is complete.


Traverse the constructed binary tree from root to leaves assigning and accumulating a '0' for one branch and a '1' for the other at each node. The accumulated zeroes and ones at each leaf constitute a Huffman encoding for those symbols and weights:
#define BYTES 256


<big>Using the characters and their frequency from the string ''"this is an example for huffman encoding"''; create a program to generate a Huffman encoding for each character as a table.</big>
long freqs[BYTES];

struct huffcode {
int nbits;
int code;
};
typedef struct huffcode huffcode_t;

struct huffheap {
int *h;
int n, s, cs;
long *f;
};
typedef struct huffheap heap_t;

/* heap handling funcs */
static heap_t *_heap_create(int s, long *f)
{
heap_t *h;
h = malloc(sizeof(heap_t));
h->h = malloc(sizeof(int)*s);
h->s = h->cs = s;
h->n = 0;
h->f = f;
return h;
}

static void _heap_destroy(heap_t *heap)
{
free(heap->h);
free(heap);
}

#define swap_(I,J) do { int t_; t_ = a[(I)]; \
a[(I)] = a[(J)]; a[(J)] = t_; } while(0)
static void _heap_sort(heap_t *heap)
{
int i=1, j=2; /* gnome sort */
int *a = heap->h;

while(i < heap->n) { /* smaller values are kept at the end */
if ( heap->f[a[i-1]] >= heap->f[a[i]] ) {
i = j; j++;
} else {
swap_(i-1, i);
i--;
i = (i==0) ? 1 : i;
}
}
}
#undef swap_

static void _heap_add(heap_t *heap, int c)
{
if ( (heap->n + 1) > heap->s ) {
heap->h = realloc(heap->h, heap->s + heap->cs);
heap->s += heap->cs;
}
heap->h[heap->n] = c;
heap->n++;
_heap_sort(heap);
}

static int _heap_remove(heap_t *heap)
{
if ( heap->n > 0 ) {
heap->n--;
return heap->h[heap->n];
}
return -1;
}

/* huffmann code generator */
huffcode_t **create_huffman_codes(long *freqs)
{
huffcode_t **codes;
heap_t *heap;
long efreqs[BYTES*2];
int preds[BYTES*2];
int i, extf=BYTES;
int r1, r2;

memcpy(efreqs, freqs, sizeof(long)*BYTES);
memset(&efreqs[BYTES], 0, sizeof(long)*BYTES);

heap = _heap_create(BYTES*2, efreqs);
if ( heap == NULL ) return NULL;

for(i=0; i < BYTES; i++) if ( efreqs[i] > 0 ) _heap_add(heap, i);

while( heap->n > 1 )
{
r1 = _heap_remove(heap);
r2 = _heap_remove(heap);
efreqs[extf] = efreqs[r1] + efreqs[r2];
_heap_add(heap, extf);
preds[r1] = extf;
preds[r2] = -extf;
extf++;
}
r1 = _heap_remove(heap);
preds[r1] = r1;
_heap_destroy(heap);

codes = malloc(sizeof(void *)*BYTES);

int bc, bn, ix;
for(i=0; i < BYTES; i++) {
bc=0; bn=0;
if ( efreqs[i] == 0 ) { codes[i] = NULL; continue; }
ix = i;
while( abs(preds[ix]) != ix ) {
bc |= ((preds[ix] >= 0) ? 1 : 0 ) << bn;
ix = abs(preds[ix]);
bn++;
}
codes[i] = malloc(sizeof(huffcode_t));
codes[i]->nbits = bn;
codes[i]->code = bc;
}
return codes;
}

void free_huffman_codes(huffcode_t **c)
{
int i;

for(i=0; i < BYTES; i++) if (c[i] != NULL) free(c[i]);
free(c);
}

#define MAXBITSPERCODE 100

void inttobits(int c, int n, char *s)
{
s[n] = 0;
while(n > 0) {
s[n-1] = (c%2) + '0';
c >>= 1; n--;
}
}

char *test = "this is an example of a huffman tree";

int main()
{
huffcode_t **r;
int i;
char strbit[MAXBITSPERCODE];
char *p;

memset(freqs, 0, sizeof(long)*BYTES);

p = test;
while(*p != 0) freqs[*p++]++;

r = create_huffman_codes(freqs);

for(i=0; i < BYTES; i++) {
if ( r[i] != 0 ) {
inttobits(r[i]->code, r[i]->nbits, strbit);
printf("%c (%d) %s\n", i, r[i]->code, strbit);
}
}

free_huffman_codes(r);

return 0;
}</lang>

=={{header|Java}}==
{{works with|Java|1.5+}}
First a Letter class to keep things organized:
<lang java5>public class Letter implements Comparable<Letter>{
public Double freq;
public char letter;
private String huff = "";
public Letter(char l, double f){
letter = l;
freq = f;
}

//For easy sorting
public int compareTo(Letter o){
if(freq == o.freq){
return ((Character)letter).compareTo(o.letter);
}
//we want higher frequencies first so add a minus
return -freq.compareTo(o.freq);
}

public String getHuff(){
return huff;
}

public void setHuff(String huff){
this.huff= huff;
}
public String toString(){
return letter + ": " + huff;
}
}</lang>
Now the class that actually does the work:
<lang java5>import java.util.Collections;
import java.util.LinkedList;
import java.util.Scanner;

public class GenHuffs{

public static void main(String[] args){
LinkedList<Letter> lets = new LinkedList<Letter>();
Scanner in = new Scanner(System.in);
while(in.hasNext()){
char symbol = in.next().charAt(0);
double freq = in.nextDouble();
Letter newLet = new Letter(symbol, freq);
lets.add(newLet);
}
Collections.sort(lets);
genHuffs(lets);
System.out.println(lets);
}
/**
* Sets the most frequent symbol to codeword 0, next most frequent to 10,
* continues prepending 1's until the last symbol which gets all 1's
*
* @param letters a list of symbols sorted by frequency
*/
public static void genHuffs(LinkedList<Letter> letters){
String currHuff = "0";
String nextHuffPre = "1";
for(Letter l:letters){
if(l!=letters.getLast()){
l.setHuff(currHuff);
currHuff = nextHuffPre + "0";
nextHuffPre += "1";
}else{
l.setHuff(currHuff.replace("0", ""));
}
}
}

}</lang>

=={{header|Python}}==
<lang python>>>> readin = "B 25 C 2.5 D 12.5 A 5 \n"
>>> cleaned = readin.strip().split()
>>> sortedin = sorted((-float(percent), letter) for letter, percent in zip(cleaned[0::2], cleaned[1::2]))
>>> ordered_letters = (pair[1] for pair in sortedin)
>>> huff = [[letter, (1<<(i+1))-2] for i, letter in enumerate(ordered_letters)]
>>> huff[-1][1] >>= 1
>>> for h in huff:
print " {0:s}: {1:b}".format(*h)

B: 0
D: 10
A: 110
C: 111
>>> </lang>

Revision as of 19:02, 27 March 2009

Task
Huffman coding
You are encouraged to solve this task according to the task description, using any language you may know.

Huffman encoding is a way to assign binary codes to symbols that reduces the overall number of bits used to encode a typical string of of those symbols.

For example, if you use letters as symbols and have details of the frequency of occurence of those letters in typical strings, then you could just encode each letter with a fixed number of bits, such as in ASCII codes. You can do better than this by encoding more frequently occurring letters such as e and a, with smaller bit strings; and less frequently occurring letters such as q and x with longer bit strings.

Any string of letters will be encoded as a string of bits that are no-longer of the same length per letter. To successfully decode such as string, the smaller codes assigned to letters such as 'e' cannot occur as a prefix in the larger codes such as that for 'x'.

If you were to assign a code 01 for 'e' and code 011 for 'x', then if the bits to decode started as 011... then you would not know iif you should decode an 'e' or an 'x'.

The Huffman coding scheme takes each symbol and its weight (or frequency of occurrence), and generates proper encodings for each symbol taking account of the weights of each symbol, so that higher weighted symbols have less bits in their encoding. (See the WP article for more information).

A Huffman encoding can be computed by first creating a tree of nodes:

  1. Create a leaf node for each symbol and add it to the priority queue.
  2. While there is more than one node in the queue:
    1. Remove the node of highest priority (lowest probability) twice to get two nodes.
    2. Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities.
    3. Add the new node to the queue.
  3. The remaining node is the root node and the tree is complete.

Traverse the constructed binary tree from root to leaves assigning and accumulating a '0' for one branch and a '1' for the other at each node. The accumulated zeroes and ones at each leaf constitute a Huffman encoding for those symbols and weights:

Using the characters and their frequency from the string "this is an example for huffman encoding"; create a program to generate a Huffman encoding for each character as a table.