## The conceptAdaptive arithmetic coding is slightly customized switch to a different base. If we consider two numbers 136 and 24 expressed in octal base and want to convert them into decimal base we apply formula:1*8 ^{2} + 3*8^{1} + 6*8^{0} = 942*8 ^{1} + 4*8^{0} = 20Same technique applies to longer number 1346132. But if number contains only these symbols 1,2,3,4,6, there is no 0,5 or 7, we can make it shorter. In case some other context properties are available such as number 3 always followed by 2 or 4, it is possible to utilize this property either. For explanation we consider two conditional groups of symbols 1,3,6 and 2,4. We call first group blue and second group red. Now we introduce new detail. We consider distance to the next symbol in sorted list and associate this distance to each symbol. For example, if 1 is followed by 3 in a sorted list 1,3,6 we associate distance 2 with symbol 1 and so on. The last symbol in the list gets distance to the base. We show all distances and symbols involved into our elementary experiment in the tables below
The classical arithmetic formula for converting of octal number 1346132 into base 10 number is follows 1*8 ^{6} + 3*8^{5} + 4*8^{4} + 6*8^{3} + 1*8^{2} + 3*8^{1} + 2*8^{0} = 379994
To turn it into sophisticated adaptive arithmetic coder we need to multiply every next term in expression by the distance associated with previous symbol in manner shown below 1*8 ^{6} + 2*[3*8^{5} + 3*[4*8^{4} + 4*[6*8^{3} + 2*[1*8^{2} + 2*[3*8^{1} + 3*[2*8^{0}] ] ] ] ] ] = 636736
This is only change we need comprehend to become guru in adaptive encoders. The result 636736 can be converted back to sequence in pretty much the same way as in a classical numbers theory. To identify first number we divide 636736 by 8 ^{6} and find interval where this value belong. 636736/8^{6} = 2.42. Since 2.42 is between 1 and
3 we identify fist symbol as 1 and when it is identified we remove it by subtracting 1*8^{6} from both parts and dividing
both parts by 2. If we continue in the same way we extract all symbols from number 636736.The computed value 636736 is not the shortest form of octal number 1346132. In order to make it shorter we need one more operation. The decoding may still go right if we choose number slightly larger 636800, for example. Being more precise we can freely choose the number from the limits LOW and HIGH, where LOW limit is already computed as 636736 and HIGH limit is HIGH = LOW + 2*3*4*2*2*3*2 = 637312, where numbers in a product are distances associated with correspondent symbol in the message. As we can see the convenient number is 637000. The selected number should be lower than HIGH but may be equal to LOW. The process of encoding and decoding is shown clearly in the tables below. The division performed for integer operands. We show red data and blue data. In decoding we need to identify number from the correct set either blue or red, so this is the matter of adaptive coding. Since encoder and decoder see the same context they can synchronously switch to different tables. In decoding we know that any symbol after 3 belong to a different set. Encoding table
Decoding table
Those who have programming experience can see that multiplication and division by 8 is only binary shift. The problem with long products is solved by approximate calculations. As you can see in every encoding step the resulted product is multiplied by new number so it is implemented as mantissa (usually 16 bit integer) and exponent. After every new multiplication we update mantissa and keep track on exponent that provides additional shift. In real compression algorithm the distances 2,3,4,2,2,3,2 that we called distances to the next number in sorted list are frequencies of occurrences. The base always may be chosen as power of 2. In case message contain all possible symbols of octal number 0,1,2,3,4,5,6,7 our generalized method turn into classical because all distances are equal to 1. Outstanding scientific issuesThe provided above example seams very simple but there are two scientific issues that left. First is a proof that shown technique really compress data to a theoretical limit defined by entropy and second is about convergence, that means theoretical proof that decoding will always work for, at least, precise data. First proof is provided by myself in anonymous Wikipedia article and second one will be considered below. Presume our alphabet contains 3 symbols following in order ( S), for
which we can define values called
frequencies and cumulative frequencies, which magnitudes may be shown graphically in the picture below
_{3}, S_{1}, S_{2}, …C can be any non-negative number, _{1}
C
where _{2} = C_{1} + f_{1}, C _{3} = C_{2} + f_{2}, B = C _{3} + f_{3}, B is the base. If we limit our sequence to 3 symbols only, the LOW and HIGH limits are calculated as follows
LOW = B^{2} *C_{3} +B*f_{3}*C_{1} + f_{3}*f_{1}*C_{2} HIGH = LOW + f _{3}*f_{1}*f_{2} =B^{2} *C_{3} +B*f_{3}*C_{1} + f_{3}*f_{1}*(C_{2} + f_{2})
Since our selected result must be in range [LOW, HIGH) we may introduce variable g, that takes values from interval
[0, f, and present result as _{2})
RESULT = B
^{2} *C_{3} +B*f_{3}*C_{1} + f_{3}*f_{1}*(C_{2} + g)
We need to poof that for any g from [0, f first step in decoding surely identifies first symbol as
_{2})S based on _{3}C. Well, that is elementary, if we divide _{3}RESULT by
B the computed value will be between ^{2}C and _{3}C +
_{3}F because fraction
_{3}
[B*C is less than 1.
_{1} + f_{1}*(C_{2}+g) ] / B^{2} Why it is less than 1? Because C is less than _{1} + f_{1}*(C_{2}+g) / B B and, that, in turn,
because (C is less than 1 for any _{2}+g) / B g from defined interval.
When C is identified it can be excluded from result
in the way already shown. Recursively, this example can be applied for the message of any size.
That also makes obvious that _{3}RESULT for message of size n is less than B.
^{n}HistorySome engineers and researchers call this technique of arithmetic encoding as range encoding because it computes result by narrowing down the range starting from [0, n ^{n}). The only difference with
classical arithmetic encoding is that the latter computes result as narrowing range starting from interval [0,1).
Switching between these two methods can be done by simple division of all numbers involved in computation by
n^{n}. The new name for the method was assigned because many people wanted to present it as
formally different and on that reason patent free. All sources and tutorials for range encoding provide
link to the same article ,
that appeared in Internet near 2000 and is not readers friendly. The encoding technique shown in the
article is very close to numerical example shown above. The decoding technique, however, is so unclear
that is subject to interpretation. There is no proof that suggested compression compresses data to entropy
limit and, although it states that this technique can be applied as adaptive, this mentioned adaptive
technique is not shown in a way that it could be reproduced. The article does not reduce
the raw idea, poorly presented in it, to a practice and on that reason could not be considered as prior
art. We may consider as prior art the big number of successful implementations of range coders written
and published near year 2000 where this article was mentioned to back up the concept. On the reason that integer concept was unclear, it slipped out the mind of many researchers and engineers, who were filing patents between 1980 and 2000. In all patents filed during that period arithmetic encoding is defined as computation of proper fraction, which may open debates of novelty. Since so many people may be benefited from this hidden out of the view but public material I decided to conduct my personal investigation of its authenticity. The red flag was not only the negligence in explanation but the fact that Glen Langdon researcher for IBM cited this article in An Introduction to Arithmetic Coding, IBM J. RES. DEVELOP. VOL. 28, No 2, March 1984 without ever mentioning integer concept and kept filing patents explaining the concept of arithmetic coding as computing of growing on every step proper fraction. No matter how it all may look like the article appeared to be authentic and G. Nigel N. Martin is a real person who confirmed for me personally that he conducted this research back in 1979. It is interesting that in the above link with the list of his publications this work is not mentioned, although it is the one for which the name G.N.N. Martin is most frequently mentioned in Internet. ## Large alphabetsThe size of the alphabet can create a specific problem. In adaptive procedure the probabilities of occurrence for every symbol must be recalculated periodically. When size of the alphabet approach 10,000 the execution slows down. The answer to large alphabets is symbol replacement technique. In this method symbols are simply replaced by sequential indexes 0,1,2,... according to their frequencies of occurrence, where most frequent numbers have least indexes. The processing then performed bitwise where each bit from each symbol is sent to a different binary coder but the output streamed to one expression. For example, let say we have alphabet A,B,C,D and a message of 1000 symbols where symbols occur according to the following table
If we process first bit of every symbol by one binary coder we get 0.47 bits/symbol. If we do the same thing with second bit of every symbol we have 0.90 bits/symbol. So, we have 1.37 bits/symbol overall. If we calculate entropy by the classical formula we have about the same number 1.367 bits/symbol. It is possible to prove theoretically that when data are distributed normally the compressed data size is in accordance with Shannon entropy and when alphabets are large the data are almost always have normal distribution. ## Entropy reductionReplacement of symbols mentioned above can be made byte by byte, when we replace most frequent bytes by lower integers. In this case the entropy of data is changed but the size of data is the same, which is very convenient. In case this procedure is adaptive no any additional data have to be saved. To make this procedure adaptive symbol should be predicted before it is replaced and before the statistics is updated.## Adaptive sorting routine used in bytewise entropy reducerWhile bitwise prediction is clear and do not need any specific explanation other than working coding example, bytewise prediction is much more complicated. In bytewise prediction we predict next byte according to set of previous bits, so we have to collect statistics for every possible combination for selected context size. For example, if our selected context size is preceding 12 bits. All combinations cover binary fragments from 000000000000 to 111111111111. All 4096 combinations have to have statistic array of 256 symbols, where each element contains the number of occurrence of particular byte. Besides that we need resort these symbols frequently and use it for replacement of every byte by its position in the sorted list. In addition to what is said already we want to resort array after every new symbol in order to accumulate every new information. Although it sounds very complicated it is actually not and all that is implemented in very short (about 80 lines) quick routine that handles 1 Meg. over 0.3 sec. For this purpose I introduced special quick sort algorithm that may not be completely new and is based on the fact that in already sorted array any new symbol may require only two elements switch.In sequential symbol processing the desired action is to update sorted data on every step. The experiments with some quick sorting algorithms such as QSORT, HSORT show that although they are extremely fast their usage in periodic sorting of data slows down the execution significantly. From the other side after every single frequency increment we may need to switch only two elements in array of collected frequencies if array was sorted before it. For example, having sorted frequencies in array as follows 107, 100, 99 , 99,
99 , 23 we need to switch third
element with fifth
element in case fifth element is incremented (we counting positions
from 1). In case first, second or sixth element is incremented
no action is required. In addition to sorted array of frequencies we
need keep track of original position of every element because we have
to replace this original value reversibly. For successful decoding we
need also maintain addresses for quick
reversing of replaced value. After some experimenting I ended up with
rather simple routine that uses three arrays. It works fast but I
presume there is still room for further optimization. Since algorithm
is elementary there are high chances that
it was used before. I did not conduct the search and presume it is not
patented because of obviousness. Let us consider these three array in a table below
The elements of the frequency array are not switched in position but incremented on every occurrence (++f[i]). The top row elements are codes returned in encoding. For synchronization we get encoded value and output encoded value before updating the table. The index array at the bottom shows the position of elements for explanation. It is not physical memory allocation. The encoding is simple array look up. We replace 0 by 2, 1 by 1, 2 by 3 and 3 by 0 in encoding step. In decoding we replace data according to inverse array, we replace 0 by 3, 1 by 1, 2 by 0 and 3 by 2. In our look up structure if S[i] = k then I[k] = i. We can see also that we need change encoding/decoding arrays only in case value 2 occurs for our particular data. In other cases we only increment correspondent frequency. If value 2 occurs (i=2) we increment F[2] and have F = {150, 190, 151, 210}. Now we rearrange S[i] and I[i] to match new data distribution. Obviously we need to switch two elements in both arrays. The function UpdateTables quickly locate these two elements and switch them. The table
after update looks as follows:
The first and third row in last table look identical but that is a coincidence that exists for small tables. In processing of real data the predicted values are 4 to 8 bit long and array sizes are between 16 and 256 accordingly. There is similarity between this method and Move-To-Front method. The difference is that in MTF every symbol is replaced by its position according to last occurrence while in suggested method the position depends on frequency. The programming implementation includes also forgetting mechanism, where frequencies collected long ago are forgotten and latest occurrence have higher influence. On that reason suggested method is considered as generalization of MTF. ## Experimental resultsThe file set for testing is taken from Maximum Compression. The compression ratio is shown relative to original data.
## ConclusionRunCoder1 is streaming adaptive entropy coder that processes bytes one-by-one while reading. The code is about 600 lines and it is supposed to be used as part of archiver not as whole and only part of it. The expected resusult is achieving compression near other context encoder, written by M.Smirnov. RunCoder15 is improved version of RunCoder1. The speed is significantly faster because data is read and written by large chunks not byte by byte. The published version process data in memory. RunCoder15 outputs data in format independently if big endian processor or little endian processor was used.Andrew Polar, Feb. 2009. Last modification, July. 2010. |