Huffman compression

Posted on 16 October 2012 in Techstuff

Huffman encoding is a way to compress data in a lossless way. The general idea of huffman encoding is to replace original characters with a certain code. The more occurences of a character the shorter the code. So for example "e" gets a short code like "011" while "Q" can have a longer code like "101110101".

Encoding

To encode the data first the codes must be calculated. To do this a so called huffman tree is created. In short this tree will contain each character as leafs, to get a code for a character one must follow a path from the root to the leaf containing a caracter. On the way trough the tree you write down a "0" when going left and a "1" when going right. So a code is actually a path through the tree. To give common characters a shorter code they are put near the root of the tree, so there is a short path from the root to the character and thus a shorter code to represent the character.

Creating the huffman tree

An example will help. Figure 1 displays in four steps how the Huffman tree is created for the string "abbcccdddd". The numbers in the nodes are their frequency. The letter D occurs the most times and is placed near the root of the tree so it gets a verry short code. The other less common characters are placed further down the tree.

Figure 1 - Building a Huffman tree for the string abbcccdddd

Steps to create the hufman tree:
[1] Count the frequency of each character.
[2] Create a node for each caracter with the character and its frequency, put them in a list.
[3] Sort the list on frequency from low to high.
[4] Remove the first 2 nodes from the list, create a new internal node with the 2 removed nodes as childs. Its frequency will be the sum of the 2 removed nodes. Put the new internal node in the front of the list.
[5] 1 node left in the list? Done, this is the root node. Else: go to step 3.

The codes:
a = 100
b = 101
c = 11
d = 0

Writing the compressed file. Now we can replace the original characters with their code and write to a binary file. The hard part when writing to a binary format is to deal with bytes, when we are at the end of a byte we must "hop" to the next byte, placing all bits in the right place. When the decoder wants to decode the data it needs the huffman tree to lookup what the codes mean. So the whole Huffman tree has to be part of the file. This can be done in a number of ways.

Decoding the data

To decode the data first the tree must be recreated from the file. Next we must follow the path indicated by the 1s and 0s from the encoded data. Start at the root, follow the path until a leaf with the caracter is reached, then return to the root.

Source code download

Below is a link where the source code in C# is available. It contains Encoder and Decoder classes. Its not really a application to run, but the classes work and you can look trough the source code if you like.

Links